The History of Educational Measurement: Key Advancements in Theory, Policy, and Practice [1 ed.] 0367370956, 9780367370954

The History of Educational Measurement collects essays on the most important topics in educational testing, measurement,

198 63 8MB

English Pages 398 Year 2021

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Contents
List of illustrations
List of contributors
Preface
Acknowledgements
Part I: Testing Movements
1 Early Efforts • Luz Bay and Terry Ackerman
2 Development and Evolution of the SAT and ACT • Michelle Croft and Jonathan J. Beard
3 The History of Norm- and Criterion-Referenced Testing • Kurt F. Geisinger
4 The Role of the Federal Government in Shaping Educational Testing Policy and Practice • Michael B. Bunch
5 Historical Milestones in the Assessment of English Learners • Jamal Abedi and Cecilia Sanchez
6 Evolving Notions of Fairness in Testing in the United States • Stephen G. Sireci and Jennifer Randall
7 A Century of Testing Controversies • Rebecca Zwick
Part II: Measurement Theory and Practice
8 A History of Classical Test Theory • Brian E. Clauser
9 The Evolution of the Concept of Validity • Michael Kane and Brent Bridgeman
10 Generalizability Theory • Robert L. Brennan
11 Item Response Theory: A Historical Perspective and Brief Introduction to Applications • Richard M. Luecht and Ronald K. Hambleton
12 A History of Scaling and its Relationship to Measurement • Derek C. Briggs
13 A History of Bayesian Inference in Educational Measurement • Roy Levy and Robert J. Mislevy
14 History of Test Equating Methods and Practices Through 1985 • Michael J. Kolen
15 A History of Rasch Measurement Theory • George Engelhard Jr. and Stefanie A. Wind
Index
Recommend Papers

The History of Educational Measurement: Key Advancements in Theory, Policy, and Practice [1 ed.]
 0367370956, 9780367370954

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

THE HISTORY OF EDUCATIONAL MEASUREMENT

The History of Educational Measurement collects essays on the most important topics in educational testing, measurement, and psychometrics. Authored by the field’s top scholars, this book offers unique historical viewpoints, from origins to modern applications, of formal testing programs and mental measurement theories. Topics as varied as large-scale testing, validity, item-response theory, federal involvement, and notable assessment controversies complete a survey of the field’s greatest challenges and most important achievements. Graduate students, researchers, industry professionals, and other stakeholders will find this volume relevant for years to come. Brian E. Clauser is Vice President of the Center for Advancement Assessment at the National Board of Medical Examiners, USA. He is Member of the Board of Editors for Applied Measurement in Education and Journal of Educational Measurement. Michael B. Bunch is Senior Advisor to Measurement Incorporated, USA, having served the company as Vice President and Senior Vice President of Research and Development since 1982.

THE HISTORY OF EDUCATIONAL MEASUREMENT Key Advancements in Theory, Policy, and Practice

Edited by Brian E. Clauser and Michael B. Bunch

First published 2022 by Routledge 605 Third Avenue, New York, NY 10158 and by Routledge 2 Park Square, Milton Park, Abingdon, Oxon OX14 4RN Routledge is an imprint of the Taylor & Francis Group, an informa business © 2022 Taylor & Francis The right of Brian E. Clauser and Michael B. Bunch to be identified as the authors of the editorial material, and of the authors for their individual chapters, has been asserted in accordance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this book may be reprinted or reproduced or utilised in any form or by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying and recording, or in any information storage or retrieval system, without permission in writing from the publishers. Trademark notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging-in-Publication Data A catalog record for this title has been requested ISBN: 978-0-367-37095-4 (hbk) ISBN: 978-0-367-41575-4 (pbk) ISBN: 978-0-367-81531-8 (ebk) Typeset in Bembo by Taylor & Francis Books

CONTENTS

List of illustrations List of contributors Preface Acknowledgements

vii ix xxii xxvi

PART I

Testing Movements 1 Early Efforts Luz Bay and Terry Ackerman

1 3

2 Development and Evolution of the SAT and ACT Michelle Croft and Jonathan J. Beard

22

3 The History of Norm- and Criterion-Referenced Testing Kurt F. Geisinger

42

4 The Role of the Federal Government in Shaping Educational Testing Policy and Practice Michael B. Bunch 5 Historical Milestones in the Assessment of English Learners Jamal Abedi and Cecilia Sanchez 6 Evolving Notions of Fairness in Testing in the United States Stephen G. Sireci and Jennifer Randall

65 87 111

vi

Contents

7 A Century of Testing Controversies Rebecca Zwick

136

PART II

Measurement Theory and Practice

155

8 A History of Classical Test Theory Brian E. Clauser

157

9 The Evolution of the Concept of Validity Michael Kane and Brent Bridgeman

181

10 Generalizability Theory Robert L. Brennan 11 Item Response Theory: A Historical Perspective and Brief Introduction to Applications Richard M. Luecht and Ronald K. Hambleton

206

232

12 A History of Scaling and its Relationship to Measurement Derek C. Briggs

263

13 A History of Bayesian Inference in Educational Measurement Roy Levy and Robert J. Mislevy

292

14 History of Test Equating Methods and Practices Through 1985 Michael J. Kolen

318

15 A History of Rasch Measurement Theory George Engelhard Jr. and Stefanie A. Wind

343

Index

361

ILLUSTRATIONS

Figures

1.1 1.2 1.3 1.4 1.5 2.1 11.1 11.2 11.3 11.4 11.5 11.6 11.7 12.1

An example of a typical hornbook (left) and a version of the New England Primer (right) An example of woodcut figures that accompanied rhymed phrases designed to help children learn the alphabet The first written arithmetic test and result The first written grammar test and result An example of grade 8 final exam from 1859 Salina, Kansas Summary of changes to the SAT Normal-ogive probabilities at γ = (–1.25, –.75, –.25, .25, .75, 1.25) with corresponding ICCs Comparisons of the three-parameter normal-ogive and logistic functions with and without D = 1.702 IRT as part of a comprehensive item-bank calibration and scoring enterprise Item and test information functions for a 10-item test (3PLcalibrated) Test assembly of multiple forms for three different target TIFs 50-Item CATs for five examinees (Item Bank: I = 600 3PL-Calibrated Items) Examples of five MST panel configurations A conceptual framework for theory and methods of scaling based on Torgerson (1958)

5 6 11 12 15 25 235 238 250 251 253 254 255 266

viii

List of illustrations

12.2

12.3 13.1

15.1 15.2 15.3

Hypothetical results from a comparison of weight with magnitudes xc and xt over replications. The scale of the x-axis is that of sensation intensity, not physical magnitude Galton’s survey items for mental visualization Graphical representation of the right-hand of de Finetti’s theorem, depicting conditional independence of the xs given θ. Alternatively, a graphical representation of the core structure of many measurement models. Reproduction of Figure 3.5 from Levy & Mislevy (2016) Frequency of citations with theme of Rasch measurement theory (Web of Science, September 2019) Three traditions of measurement: Test-Score, Scaling, and Structural Traditions Concept map for Rasch measurement theory

272 277

297 344 345 351

Tables

5.1 5.2 12.1

15.1

15.2 15.3

Correlation between state assessment scores and Raven test score Raven CPM mean scores for EL and non-ELs Statistical scale with descriptive reference points for illumination of visualized mental image. Based on Galton, 1883, p. 93 General form of the operating characteristic function for defining a family of unidimensional Rasch measurement models Log-odds format for family of Rasch models Selection of key books on Rasch measurement theory (1960 through 2020)

90 91

278

352 352 353

Box

15.1

Five requirements of invariant measurement (Engelhard, 2013)

348

CONTRIBUTORS

Brian E. Clauser received his doctorate from the University of Massachusetts, Amherst. Since 1992 he has worked at the National Board of Medical Examiners, where he is currently vice president for the Center for Advanced Assessment. Dr. Clauser has published more than 100 journal papers and book chapters on issues related to differential item functioning, performance assessment, automated scoring of complex assessments, standard setting, applications of generalizability theory, test validity, and the history of psychometrics. He is a fellow of the American Educational Research Association, a past editor of the Journal of Educational Measurement, the current editor of the NCME book series, and the 2018 recipient of the NCME Career Contribution award. Michael B. Bunch is a senior advisor to Measurement Incorporated, having served as senior vice president and in various other capacities from 1982 to 2020. Previously, he was a senior professional at NTS Research Corporation and research psychologist at ACT. Dr. Bunch received his Ph.D. in psychology (Measurement and Human Differences) from the University of Georgia in 1976. He has served as reviewer for Buros Mental Measurements Yearbook, Journal of Educational Measurement, and other national and international testing journals. He is co-author, with Gregory J. Cizek, of Standard Setting: Establishing and Evaluating Performance Standards on Tests. Jamal Abedi is a professor of educational measurement at the University of California, Davis. Dr. Abedi's research interests include studies in the areas of psychometrics and test development. His recent work includes studies on the validity of assessment, accommodation, and classification for English learners (ELs) and ELs with disabilities. Abedi serves on assessment advisory boards for a number

x List of contributors

of states and assessment consortia as an expert in testing ELs. Abedi is the recipient of the 2003 Outstanding Contribution Relating Research to Practice award by the American Educational Research Association (AERA), the 2008 Lifetime Achievement Award by the California Educational Research Association, the 2013 Outstanding Contribution to Educational Assessment award by the National Association of Test Directors, the 2014 UC Davis Distinguished Scholarly Public Service Award, the 2015 UC Davis School of Education Outstanding Faculty award and the 2016 National AERA E.F. Lindquist Award. He holds a master’s degree in psychology and a doctoral degree in psychometrics from Vanderbilt University. Terry Ackerman is currently a Distinguished Visiting Professor in the Department of Psychological and Quantitative Foundations at the University of Iowa. He received his BS degree from the University of Wisconsin-Madison (1972), and MS (1979) and Ph.D. (1984) from the University of Wisconsin-Milwaukee. He worked as a Psychometrician at ACT (1984–1989), a Professor at the University of Illinois-Champaign (1990–2000), Professor, Chair, and Associate Dean of Research at the University of North Carolina-Greensboro, and E.F. Lindquist Research Chair at ACT (2016–18). He served as president of NCME from 2009–2010 and president of the Psychometric Society (2015–16). He has been a member of several technical advisory committees including ETS’ GRE Technical Advisory Committee, Defensive Advisory Committee, Measured Progress Technical Advisory Committee, the American Institute of Certified Public Accountants, the College Board Research Advisory Committee, and ETS’ Graduate Education Advisory Committee. Luz Bay is currently a Senior Psychometrician at The College Board. Prior to joining The College Board, Dr. Bay was a psychometrician and assistant vicepresident in-charge of all data analyses, reporting, and quality assurance for K–12 assessment programs at Measured Progress where, among other duties, she led the team to assist the National Assessment Governing Board (NAGB) in setting achievement levels for the National Assessment of Educational Progress (NAEP). She successfully automated two popular standard setting methods—Body of Work and Bookmarking. Dr. Bay has been a member of the National Council of Measurement in Education (NCME) Board of Directors (2015–2018) and the Psychometric Society Board of Trustees (2007–2015). Dr. Bay received her Ph.D. in Educational Measurement and Statistics and M.S. in Mathematics from Southern Illinois University at Carbondale. She has a B.S. in Mathematics from the University of the Philippines at Los Baños. Jonathan J. Beard is a Psychometrician at The College Board. He works primarily with standard-setting processes for the SAT and AP exams, as well as working with states for support of SAT contracts. Some additional research topics include the effect of AP exams on postsecondary outcomes and the dimensionality of the SAT exam structure.

List of contributors xi

Robert L. Brennan is E. F. Lindquist Chair in Measurement and Testing, Emeritus, and the Founding Director of the Center for Advanced Studies in Measurement and Assessment (CASMA), in the College of Education of The University of Iowa. Dr. Brennan is the author or co-author of numerous journal articles and several books including Generalizability Theory (2001) and Test Equating, Scaling, and Linking Methods and Practices (2014). Also, he is the editor of the fourth edition of Educational Measurement (2006). He has served as Vice-President of Division D of the American Educational Research Association (AERA), and President of the National Council on Measurement in Education (NCME). He is the co-recipient of the 2000 NCME Award for Career Contributions to Educational Measurement, recipient of the 2004 E. F. Lindquist Award for Contributions to the Field of Educational Measurement sponsored by the AERA and ACT, and recipient of the 2011 Career Achievement Award from the Association of Test Publishers. Brent Bridgeman received a Ph. D. in educational psychology from the University of Wisconsin-Madison in 1972. After three years as an assistant professor at the University of Virginia he came to Educational Testing Service where he eventually earned the title of Distinguished Presidential Appointee. His current interests are in applied validity and fairness research. He is especially interested in essay assessments as scored by both humans and machines, better ways of communicating predictive validity results to the public, and the impact of test time limits on the validity and fairness of test scores. Derek C. Briggs is a professor in the Research and Evaluation Methodology program in the School of Education at the University of Colorado Boulder where he also directs the Center for Assessment Design Research and Evaluation. He has recently completed work on the book Credos and Controversies: Historical and Conceptual Foundations of Measurement in the Human Sciences. When he is not pondering the nature of measurement, Dr. Briggs’s research focuses upon advancing methods for the measurement and evaluation of student learning. He has special interests in the use of learning progressions as a method for facilitating student-level inferences about growth, and in the use of statistical models to the support causal inferences about educational interventions. Dr. Briggs works with states and other entities to provide technical advice on the design and use of large-scale student assessments. He is the past editor of the journal Educational Measurement: Issues and Practice and was elected president of the National Council on Measurement in Education for 2021–22. Michelle Croft is a research and data analyst at the Iowa City Community School District. She was previously a Principal Research Scientist at ACT, where she specialized in K–12 policy research and education law. Her recent projects have included the use of college entrance exams under ESSA; student data privacy; testing opt-outs; and test security.

xii List of contributors

George Engelhard, Jr. joined the faculty at The University of Georgia in the fall of 2013. He is Professor Emeritus at Emory University (1985 to 2013). He received his Ph.D. in 1985 from The University of Chicago (MESA Program – measurement, evaluation, and statistical analysis). He is a fellow of the American Educational Research Association. Dr. Engelhard is co-editor of four books, and the author or co-author of over 175 journal articles, book chapters, and monographs. In 2015, he received the first Qiyas Award for Excellence in International Educational Assessment recognizing his contributions to the field of education based on his book (Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences), as well as his program of research that focuses on the improvement of educational measurement at the local, national and international levels. He recently published Invariant Measurement with Raters and Rating Scales: Rasch Models for Rater-mediated Assessments (with Stefanie A. Wind). His book Rasch Models for Solving Measurement Problems: Invariant Measurement in the Social Sciences (with Jue Wang) is forthcoming. Kurt F. Geisinger is Director of the Buros Center on Testing and the W. C. Meierhenry Distinguished University Professor at the University of Nebraska. He is 2018–2020 President of the International Test Commission, 2019–2020 President of the Quantitative and Qualitative Methodology division of the American Psychological Association (APA), and a vice president of the International Association of Applied Psychology. He has edited numerous textbooks on testing, The ITC International Handbook of Testing and Assessment and several editions of the Mental Measurements Yearbooks and other Buros Institute publications. Dr. Geisinger has served in various leadership roles in the American Psychological Association and is a fellow of six APA divisions. He has also served as a psychology department chair, dean, and academic vice president/provost. Ronald K. Hambleton is a Distinguished Professor Emeritus and former Executive Director of the Center for Educational Assessment at the University of Massachusetts Amherst where he has been a member of the faculty for 41 years. He is co-author of several textbooks including Fundamentals of Item Response Theory and editor or co-editor of several books including Handbook of Modern Item Response Theory, International Perspectives on Academic Assessment, Computer-Based Testing and the Internet, and Adaptation of Educational and Psychological Tests for Cross-Cultural Assessment, and he is the author of more than 600 articles in measurement and statistics. He has been the recipient of numerous national and international awards and is a past president of the National Council on Measurement in Education, the International Test Commission, Division 5 of the American Psychological Association, and Division 2 of the International Association of Applied Psychology. Michael Kane has been the Messick Chair in Validity at the Educational Testing Service since 2009. He served as Director of Research at the National

List of contributors xiii

Conference of Bar Examiners from 2001 to 2009, and as a professor in the School of Education at the University of Wisconsin from 1991 to 2001. Prior to 1991, Dr. Kane served as VP for research and development and as a senior research scientist at American College Testing (ACT), as Director of Test Development at the National League for Nursing, as a professor of education at SUNY, Stony Brook, and as Director of Placement and Proficiency Testing at the University of Illinois, Urbana-Champaign. His main research interests are validity theory and practice, generalizability theory, licensure and certification testing, and standard setting. Dr. Kane holds a Ph.D. in education and an M.S. in statistics from Stanford University and a B.S. and M.A. in physics from Manhattan College and SUNY, Stony Brook, respectively. Michael J. Kolen is a Professor Emeritus in Educational Measurement and Statistics at the University of Iowa, where he was a Professor from 1997–2017. He served on the faculty at Hofstra University from 1979–1981, and he worked at ACT from 1981–1997, including being Director of the Measurement Research Department from 1990–1997. He co-authored three editions of the book Test Equating, Scaling, and Linking: Methods and Practices. He published numerous articles and book chapters on test equating and scaling and related topics. Dr. Kolen was President of the National Council on Measurement in Education (NCME), is past editor of the Journal of Educational Measurement, and is Founding Editor of the NCME Book Series. He is a Fellow of Division 5 of the American Psychological Association and a Fellow of the American Educational Research Association. He served on the 2014 Joint Committee on the Standards for Educational and Psychological Testing. He received the 1997 NCME Award for Outstanding Technical Contribution to the Field of Educational Measurement, the NCME 2020 Annual Award, and the 2008 NCME Award for Career Contributions of Educational Measurement. Roy Levy is a Professor in the T. Denny Sanford School of Social & Family Dynamics at Arizona State University, specializing in Measurement and Statistical Analysis. His scholarly interests include methodological investigations and applications in psychometrics and statistical modeling, focusing on item response theory, structural equation modeling, Bayesian networks, and Bayesian approaches to inference and modeling. Richard M. Luecht is a Professor of Educational Research Methodology at the UNC-Greensboro where he teaches graduate courses in applied statistics and advanced measurement. His research includes technology integration in assessment, advanced psychometric modeling and estimation, and the application of engineering design principles for formative assessment (i.e., assessment engineering or AE). He has designed numerous algorithms and software programs for automated test assembly and devised a computerized adaptive multistage testing

xiv List of contributors

framework used by several large-scale testing programs. Dr. Luecht is also a technical consultant and advisor for many state Department of Education testing agencies and large-scale testing organizations. Robert J. Mislevy is the Frederic M. Lord Chair in Measurement and Statistics at ETS and Professor Emeritus of Measurement and Statistics at the University of Maryland. Dr. Mislevy's research applies developments in technology, statistical methods, and cognitive science to practical problems in assessment. His projects include an evidence-centered framework for assessment design (ECD), the plausible values methodology used in the National Assessment for Educational Progress, and collaborations with the Cisco Networking Academy on simulation-based assessment and with GlassLab on game-based assessment. His recent books are Sociocognitive Foundations of Education Measurement, Bayesian Psychometric Modeling (with Roy Levy), and Bayesian Inference Networks for Educational Assessment (with Russell Almond, Linda Steinberg, Duanli Yan, and David Williamson). He is a member of the National Academy of Education, a past-president of the Psychometric Society, a recipient of AERA’s Lindquist award for career contributions and NCME’s Career Contributions Award, and he has received the NCME’s Annual Award for Outstanding Contribution to Educational Measurement four times. Jennifer Randall is an Associate Professor and Director of Evaluation Services for the Center for Educational Assessment at the University of Massachusetts Amherst. She earned her BA (1996) and MAT (1999) from Duke University and Ph. D. in Education from Emory University (2007). Prior to her graduate studies, Dr. Randall taught pre-school and high school social studies for several years. She is currently interested in the differential negative impact of both large and smallscale assessments on historically marginalized populations in the US and abroad; and the ways in which a culturally responsive approach to assessment can mitigate these negative outcomes. She currently sits on multiple technical advisory committees/task forces including the National Assessment of Educational Progress (NAEP) and the Massachusetts Accountability and Assistance Advisory Council. She has also served as the Associate Editor for the Journal of Educational Measurement, Co-Chair for the 2013 Northeastern Research Association (NERA) and 2015 National Council on Measurement in Education (NCME) annual conferences, Chair of the Diversity in Testing Committee (NCME), and Chair of the Research on Evaluation Special Interest Group (AERA). She teaches courses in measurement theory, classroom assessment, and research methods. Cecilia Sanchez is a Research Project Coordinator at UC Davis School of Education. After graduating from UC Santa Barbara in 2016, Sanchez became a research assistant for Advanced Research & Data Analysis Center (ARDAC). While at ARDAC she recruited schools across the US for multiple projects; collected data through interviews, surveys, and assessments; and managed data input.

List of contributors xv

Since then Sanchez has become a leading researcher on Abedi’s team. Her current tasks include those listed at ARDAC with the addition of data analysis, project management, and editor for reports and publications. Sanchez also has direct experience working with children with disabilities, specifically autism and other developmental disabilities. Stephen G. Sireci is Distinguished University Professor and Director of the Center for Educational Assessment in the College of Education at the University of Massachusetts Amherst. He is known for his research in evaluating test fairness, particularly issues related to content validity, test bias, cross-lingual assessment, standard setting, and computerized-adaptive testing. He has authored/coauthored over 130 publications, and is the co-architect of the multistage-adaptive Massachusetts Adult Proficiency Tests. He is a Fellow of the American Educational Research Association, and of Division 5 of the American Psychological Association; and PastPresident of the National Council on Measurement in Education. Stefanie A. Wind is an Assistant Professor of Educational Measurement at the University of Alabama. Her primary research interests include the exploration of methodological issues in the field of educational measurement, with emphases on methods related to rater-mediated assessments, rating scales, latent trait models (i.e., Rasch models and item response theory models), and nonparametric item response theory, as well as applications of these methods to substantive areas related to education. Dr. Wind received the Alicia Cascallar early career scholar award from the National Council on Measurement in Education, the Exemplary Paper Award from the Classroom Observation SIG of AERA, and the Georg William Rasch Early Career Scholar award from the Rasch SIG of AERA. Rebecca Zwick is a Distinguished Presidential Appointee in the Psychometrics, Statistics, and Data Sciences area at Educational Testing Service and Professor Emerita at the Gevirtz Graduate School of Education at UC Santa Barbara. Her recent research has focused on test fairness issues and on college admissions. She is the author of more than 100 publications in educational measurement and statistics and education policy, including Who Gets In? Strategies for Fair and Effective College Admissions (2017) and Fair Game? The Use of Standardized Admissions Tests in Higher Education (2002). She also served as editor of Rethinking the SAT: The Future of Standardized Testing in University Admissions (2004). She received her Ph.D. in Quantitative Methods in Education at UC Berkeley, completed a postdoctoral year at the Thurstone Psychometric Laboratory at the University of North Carolina at Chapel Hill, and obtained an M.S. in Statistics at Rutgers University. She is a fellow of the American Educational Research Association and the American Statistical Association and served as President of the National Council on Measurement in Education in 2018–2019.

xvi List of contributors

Karen M. Alexander, M. Ed., is a Ph. D. graduate student in Educational Psychology at the University of Nebraska-Lincoln studying Quantitative, Qualitative and Psychometric Methods. Prior to pursuing her doctorate she worked in K–12 education, both in public and private schools. She has experience in K–12 school accreditation as well as in faculty training and evaluation. Her research interests include psychometrics, applying machine learning to social science, program evaluation, and standardized testing. Peter Baldwin received his doctorate from the University of Massachusetts, Amherst, where he was also a senior research fellow at the College of Education’s Center for Educational Assessment. During this time, Dr. Baldwin was the coarchitect of the multistage-adaptive Massachusetts Adult Proficiency Test. Since 2007, he has worked at the National Board of Medical Examiners, where he is currently a senior measurement scientist. His primary research interests include psychometric methods and natural language processing. In 2015, Dr. Baldwin received the Alicia Cascallar early career scholar award from the National Council on Measurement in Education. Michael Beck is an expert in the development and use of standardized educational assessments. While not personally acquainted with Alfred Binet, his professional career began only shortly after Binet’s contributions ended. Beck was a longtime psychometrician and manager for The Psychological Corporation and Questar Assessment and has operated his own assessment corporation for over two decades. He has provided assessment consultation, performance standard setting, and development activities for high-stakes tests in more than 30 states. He has conducted test development activities for all major test publishers, nearly 50 textbook publishers, and a variety of federal, military, and industrial clients. A frequent writer and speaker on assessment issues, he has made professional presentations in 47 states and at over 75 national educational conferences. His published research and assessment-policy writings have appeared in over 20 professional journals and edited books. He is currently the president of BETA, LLC. Randy E. Bennett is Norman O. Frederiksen Chair in Assessment Innovation in the Research & Development Division at Educational Testing Service. Dr. Bennett's work has focused on integrating advances in cognitive science, technology, and educational measurement to create approaches to assessment that have positive impact on teaching and learning. For a decade, he directed an integrated research initiative titled Cognitively-Based Assessment of, for, and as Learning (CBAL). Dr. Bennett is immediate past president of the International Association for Educational Assessment (IAEA) (2016–2019) and past president of the National Council on Measurement in Education (NCME) (2017–2018). He is a Fellow of the American Educational Research Association, winner of the NCME

List of contributors xvii

Bradley Hanson Contributions to Educational Measurement Award from the National Council on Measurement in Education, and winner of the Distinguished Alumni Award from Teachers College, Columbia University. Daniel Bolt is Nancy C. Hoefs-Bascom Professor of Educational Psychology at the University of Wisconsin, Madison. Dr. Bolt’s primary research interests are in multidimensional item response theory, especially applications to test validation, assessment of individual differences (such as response styles), the modeling of student growth, and novel item formats in computer-based assessments. Wayne J. Camara is the Distinguished Research Scientist for Innovation at the Law School Admissions Council. Previously, he served as Senior Vice President of Research at ACT and Vice President of Research at College Board. Dr. Camara is a fellow of APS, AERA and SIOP, and three divisions of APA. He is past president of the National Council for Measurement in Education, past Vice President of AERA Division D, past president of APA’s Division of Evaluation, Measurement & Statistics, past chair of the Association of Test Publishers, and an associate editor or on the editorial board of journals in education and industrial psychology. He is currently on the council of the International Test Commission (ITC). He has served as technical advisor for US assessment programs for the military (DOD), law school (LSAC), medical school (AAMC), accountants (AICPA), student athletes (NCAA), and various state and employment testing programs. Wayne has served as the project director for the 1999 Standards for Educational and Psychological Testing, chair of the Management Committee for the 2014 revision, and on committees developing testing standards for ISO, SIOP, ITC, APA, and the Code of Fair Testing Practices. He is recipient of career awards from SIOP and ATP. Jerome Clauser earned his doctorate at the University of Massachusetts Amherst, where he studied under the direction of Ron Hambleton. Since 2013, he has worked for the American Board of Internal Medicine. He currently serves as ABIM’s Senior Director of Research & Innovations and is responsible for the organization’s measurement research agenda. In addition, he serves on the NCME Publications Committee and the Editorial Board for Applied Measurement in Education. Craig Deville received his Ph. D. in Educational Measurement and Evaluation from The Ohio State University. He spent the first years of his career working in the area of licensure and certification testing. He then turned to educational testing working at the Iowa Testing Programs, University of Iowa, and Measurement Incorporated (MI). Craig recently retired as Director of Psychometric Services at MI. Throughout his career Craig has been interested and active in the field of language testing. He has served on the boards of language testing organizations and journals.

xviii List of contributors

Ron Dietel was the Director of Communications of the UCLA CREEST from 1992 until his retirement in 2014. His work involved preparation of monthly CRESST technical research paper and helping other senior researchers by editing their research papers. In addition to writing several educational films, Dr. Dietel has authored Get Smart! Nine Sure Ways to Help Your Child Succeed in School (Wiley/Jossey Bass, 2006) and written for publications such as School Administrator magazine. He has also spoken frequently on educational topics at conferences such as the American Educational Research Association and National Council on Measurement in Education. Neil J. Dorans received his Ph.D. in quantitative psychology from the University of Illinois and worked for 42 years at Educational Testing Service, where he focused on the linking and scaling of test scores, and methods for assessing fairness at the item and score level. He proposed a procedure for assessing differential item functioning in the early 1980s. Dr. Dorans was the architect for the recentered SAT scales introduced in the mid-1990s. Dr. Dorans co-edited Fairness in Educational Assessment and Measurement. He also was the lead editor for Linking and Aligning Scores and Scales, and for Looking Back: Proceedings of a Conference in Honor of Paul W. Holland, and he has published numerous journal articles, technical reports, and book chapters on differential item functioning, score equating and score linking, context effects, and item response theory. Dr. Dorans received the ETS Measurement Statistician Award in 2003, the National Council of Measurement in Education’s Career Contributions Award, and the Association of Test Publishers Career Achievement Award. William P. Fisher, Jr. currently holds positions as Senior Scientist at the Research Institute of Sweden in Gothenburg, and as Research Associate with the BEAR Center in the Graduate School of Education at the University of California, Berkeley. He consults independently via Living Capital Metrics LLC, and as a partner with INNORBIS.com. He co-edited a 1994 special issue of the International Journal of Educational Research (with Benjamin Wright) and coedited a 2017 volume honoring Benjamin Wright's career (with Mark Wilson). Deborah J. Harris is currently Visiting Professor in Educational Measurement and Statistics at The University of Iowa, where she conducts research, works with graduate students, and teaches. She is also editor of the National Council of Measurement in Education journal Educational Measurement: Issues and Practice. Formerly, Dr. Harris was Vice President of Psychometric Research at ACT, Inc., She has presented and published extensively particularly in the area of comparability of test scores, including equating, concordance, vertical scaling, and test security. Suzanne Lane is a Professor of Research Methodology in the School of Education at the University of Pittsburgh. Her scholarly interests are in educational

List of contributors

xix

measurement, with a focus on design, technical, validity and policy issues in testing. She is a co-editor of the Handbook of Test Development. She was the President of the National Council of Measurement in Education (2003–2004), Vice President of Division D-AERA (2000–2002), member of the Joint Committee for revising the Standards for Educational and Psychological Testing (1993–1999). She has served on the Editorial Boards for the Journal of Educational Measurement, Applied Measurement in Education, Educational Assessment, Educational Researcher, and Educational Measurement: Issues and Practice. Dr. Lane has also been a member of technical advisory boards for organizations including ETS, The College Board, PARCC, NCSC, and numerous states. Andrew Maul is an Associate Professor in the Department of Education at the University of California, Santa Barbara. His work focuses on the conceptual and historical foundations of research methodology in the human sciences, and in particular on the theory and practice of measurement. His work integrates lines of inquiry traditionally associated with philosophy, psychology, history, and statistics, and aims to help improve the logical and ethical defensibility of methodological practices in the human sciences. He regularly teaches courses on the construction and validation of measuring instruments, item response theory, and the philosophy of measurement, as well as introductory and advanced research methods and applied statistics. Joseph McClintock, Vice President of Research and Development at Measurement Incorporated, has over 20 years of experience in psychological research, assessment, and project management. As the Vice President for Research and Development for Measurement Inc., he supervises a staff of more than 50 professionals and provides leadership in all aspects of test and item development. Prior to joining MI, he was Director of Examination Services & Research for the American Board of Anesthesiology where he advised the Board of Directors on psychometric issues for objective and performance examinations. He has published peer-reviewed articles on certification performance, data forensics and erasure analyses. Tim Moses has done work and ods. He completed his doctorate trician at Educational Testing Psychometrician and the Robert the College Board.

research in equating, scaling and linking methat University of Washington, was a psychomeService, and has held positions as Chief L. Brennan Chair of Psychometric Research at

Anthony J. Nitko is a private consultant, and Professor Emeritus and former Chairperson of the Department of Psychology in Education at the University of Pittsburgh. Among his publications are the chapter, “Designing Tests that are Integrated with Instruction” in the Third Edition of Educational Measurement,

xx List of contributors

Assessment and Grading in Classrooms with Susan Brookhart (2008), and Educational Assessment of Students (8th edition) with Susan Brookhart (2018). Some of his work has been translated into Arabic, Japanese, and Turkish. He was Editor of the journal Educational Measurement: Issues and Practice. He was elected Fellow to the American Psychological Association, to the Board of Directors of the National Council on Measurement in Education, and as President of the latter. He received Fulbright awards to Malawi and to Barbados. He has served as a consultant to various government and private agencies in Bangladesh, Barbados, Botswana, Egypt, Ethiopia, Indonesia, Jamaica, Jordan, Liberia, Malawi, Maldives, Namibia, Oman, Saudi Arabia, Singapore, the United States, Vietnam, and Yemen. Corey Palermo is Executive Vice President & Chief Strategy Officer at Measurement Incorporated (MI), where he spearheads the development and execution of MI’s strategic initiatives and leads MI’s Performance Assessment Scoring department. Dr. Palermo’s research examines rater effects in large‐scale assessment contexts; automated scoring, in particular automated writing evaluation applications to improve the teaching and learning of writing; and teacher professional development in the context of large-scale assessment programs. His work has been published in leading peer-reviewed education journals such as Contemporary Educational Psychology, Journal of Educational Measurement, Journal of Writing Research, and Teaching and Teacher Education. Richard J. Shavelson is Professor of Education and Psychology, Dean of the Graduate School of Education and Senior Fellow in the Woods Environmental Institute (Emeritus) at Stanford University. He was president of AERA; a fellow of American Association for the Advancement of Science, AERA, American Psychological Association, and the American Psychological Society; a Humboldt Fellow; and member of National Academy of Education and International Academy of Education. His work focuses on performance assessment of undergraduate learning. His publications include Statistical Reasoning for the Behavioral Sciences, Generalizability Theory: A Primer, Scientific Research in Education; Assessing College Learning Responsibly: Accountability in a New Era. Sandip Sinharay is a principal research scientist at Educational Testing Service (ETS). He is the current editor of the Journal of Educational Measurement and was an editor for the Journal of Educational and Behavioral Statistics between 2010 and 2014. Over the course of his career, Dr. Sinharay has won multiple awards from the National Council on Measurement in Education including the Bradley Hanson Award, the Award for Outstanding Technical or Scientific Contribution to the Field of Educational Measurement in 2015 and 2009, and the Jason Millman Promising Measurement Scholar Award in 2006. He is the joint editor of two books, including the volume on psychometrics in the Handbook of Statistics

List of contributors

xxi

series, and has authored more than 100 research articles in peer-reviewed journals on educational and psychological measurement and statistics. Howard Wainer is a Fellow in both the American Statistical Association and the American Educational Research Association. He has received numerous awards including the Educational Testing Service's Senior Scientist Award, the 2006 National Council on Measurement in Education Award for Scientific Contribution to a Field of Educational Measurement, the National Council of Measurement in Education’s career achievement award in 2007, and the Samuel J. Messick Award for Distinguished Scientific Contributions Award from Division 5 of the American Psychological Association in 2009. Dr. Wainer was editor of the Journal of Educational and Behavioral Statistics and is a former Associate Editor of the Journal of the American Statistical Association. Michael Walker is distinguished Presidential Appointee and Director of the Fairness and Equity Research Methodologies Institute at Educational Testing Service. Dr. Walker holds a Ph. D. in quantitative psychology and an M.S. in statistics, both earned at the University of Illinois, Urbana-Champaign. He also has an M.A. and a B.A. in psychology, both earned at Wake Forest University. With more than three decades of involvement in the field of testing, Dr. Walker is a recognized, published expert on all aspects of designing and maintaining testing programs. Current research focuses on issues in standardized testing: equating, concordance, subgroup differences, and essay reliability. Published work includes fair test design and use, maintaining and transitioning testing programs; test scaling and equating; test reliability; and use of constructed response items.

PREFACE

This is not a history book in the strictest sense of the term. None of the authors are historians. We are instead participants in much of the history covered in this book and as such are likely to have particular perspectives and even biases, which will become apparent. Nevertheless, we have undertaken to write this book because no true historian has come forward to do so. Some historians have written on assorted topics in educational testing (e.g., McLaughlin, 1974; Vinovskis, 1998; Reese, 2013) or more comprehensive histories of testing in general (e.g., DuBois, 1970; Sokal, 1987) but until now there has been no comprehensive history of educational testing in America that addresses the technical, social, political, and practical aspects of the subject from its origins to the present time. This book is divided into two sections with interconnected content: Testing Movements (Part I) and Measurement Theory and Practice (Part II). For each section, we have invited individuals with particular interest in—and frequently extensive involvement in—the chapters they have written. While both of the editors have been in this field for a considerable number of years and thought we knew our own history, we learned quite a bit reading these chapters that we did not know before. We hope you will have the same experience.

Part I: Testing Movements In this section, we hope to show how educational testing as we now know it evolved over the course of two centuries, in parallel with changes in the larger enterprise of education as well as those of society in general. Ideas that shaped how Americans saw themselves and the rest of the world—egalitarianism, social Darwinism, progressivism, the efficiency movement, and various reform movements—were clearly reflected in the various testing movements of their times.

Preface xxiii

A common theme in all testing movements over the past two centuries has been the steady progression toward greater standardization and consolidation. In 1845, Horace Mann’s standardized written exams replaced ad hoc recitations as a way to determine what students actually knew and could do. They also consolidated authority over the production and administration of tests and the actions taken in response to the results of those tests. Today, the Every Student Succeeds Act of 2015 and its many predecessors stretching back fifty years, while granting considerable authority to state and local education agencies, standardizes test content and administration procedures, dictates who will be tested, and consolidates authority over test use and interpretation at the highest levels. Specific testing movements have reflected the American Zeitgeist of their respective eras. The actions of Horace Mann and others in the middle of the 19th century reflected the impatience of a growing nation with the old ways and a desire to take its place on the world stage. Doing so would require a well-educated public, with assessment of the quality of that education under centralized authority. As we entered the 20th century, with more and more young people applying for admission to college, educators saw the need for standardized methods for deciding who should be admitted. As society’s mood shifted in the middle of the 20th century from sifting and sorting students to certifying their competence and mastery, test experts obliged by introducing (or perhaps we should say reintroducing, as such was the focus of Horace Mann’s tests in 1845) criterion-referenced measurement. With the introduction of criterion-referenced testing, a window on the operations of schools, districts, and even whole states opened up and gave way to federal legislation mandating testing of all students in specified grades over specified content.

Part II: Measurement Theory and Practice The essays in the second part of this volume describe how measurement theory has evolved in parallel with the practice of testing. The typical student studying measurement might well come away with the idea that the various theoretical approaches to problem solving in assessment were developed as part of a coherent—and perhaps inevitable—framework. A careful reading of these chapters will do much to correct that impression. To the extent that a coherent framework exists it is not the result of inevitable evolution, but rather the result of the fact that over a century and a half many different researchers and theorists have asked very different kinds of questions and in the process contributed to the field in ways they may not have intended or anticipated. Francis Galton’s interest was in human inheritance. The correlation coefficient that he developed to study inheritance provided the framework that allowed Charles Spearman to produce the equations that are essential for classical test theory. Those equations are key to much of measurement theory, but ironically Spearman had little interest in the interpretation of individual test scores. He produced these equations as tools for correlational psychology which he needed

xxiv Preface

to facilitate his study of the nature of intelligence. From these unintended beginnings classical test theory had its start; the missing parts of the theory were then, very intentionally, filled in by researchers including Kelley, Kuder and Richardson, Lord, and Cronbach. In a way, the work of Galton and Spearman in England was paralleled by Fechner in Germany. There is no evidence that Fechner intended to develop a theoretical framework for scaling. His immediate interest was the study of human sensation. His studies led to the creation of psychophysics and by extension experimental psychology. Psychophysics may be of little interest to most readers interested in measurement theory, but Fechner’s work to understand human sensation (along with Galton’s efforts to measure differences in inherited characteristics) nonetheless represents the starting point for psychological scaling. This early work, motivated by very different intentions, has provided the basis by which we understand the nature of measurement and has over time evolved into the sophisticated methods currently in use for scaling and equating. Validity theory is yet another area where comprehensive frameworks—such as those presented by Messick (1989) and Kane (2013)—may disguise the fact that the tools we use to evaluate test scores have evolved over decades because of social pressures and changes in the focus of testing and the interests of measurement professionals. As this focus shifted from selection and prediction to achievement to assessment of personality and more recently to program evaluation for district and state educational systems, the tools needed to evaluate validity have evolved as well. Even the models we tend to think of as modern theories of measurement—item response theory and Bayesian inference—have deep historical roots. Bayesian inference has its beginning in the 18th century and has been over time widely accepted, vociferously rejected (by statisticians including R. A. Fisher) and more recently again widely accepted because of its practical utility. The historic roots of item response theory are equally important. These models are a response to the limitations of classical test theory: most notably the fact that classical test theory results are sample dependent. This in itself ties the models to a very long history that unavoidably shapes the theory. The models are also tied to the tradition of factor analysis and the work of sociologist Paul Lazarsfeld to extend the approaches to latent structure analysis. Finally, item response theory developed not once, but twice—as represented by the work of Georg Rasch and Alan Birnbaum. Those two lines of development have led researchers to emphasize different aspects of the theory and for a time led to very public disputes about the theory. We hope you will enjoy reading about how test theory and practice has evolved over the last century and a half and we hope that in the process you will gain a new perspective on test theory and practice. To borrow a statement that Wainer (2013, p. 118) has attributed to Aristotle, “We understand best those things we see grow from their very beginnings.” Each generation will necessarily

Preface xxv

adapt and extend measurement theory and if practitioners understand how those theories came into being, they will better understand how to re-envision those theories for the challenges of their generation.

References DuBois, P. H. (1970). A History of Psychological Testing. Boston: Allyn and Bacon. Kane, M. T. (2013). Validating the interpretations and uses of test scores. Journal of educational measurement, 50(1), 1–73. McLaughlin, M. W. (1974). Evaluation and reform: The Elementary and Secondary Education Act of 1965, Title I. Santa Monica, CA: Rand. Messick, S. (1989). Validity. In R. L. Linn (Ed.) Educational measurement (3rd ed., pp. 13– 103), New York: American Council on Education and Macmilllan. Reese, W. J. (2013). Testing Wars in the Public Schools: A Forgotten History. Cambridge, MA: Harvard University Press. Sokal, M. M. (Ed.) (1987). Psychological Testing and American Society. Rutgers, NJ: Rutgers University Press. Vinovskis, M. A. (1998). Overseeing the nation’s report card: The creation and evolution of the National Assessment Governing Board. Paper prepared for the National Assessment Governing Board. Ann Arbor, MI: University of Michigan. Retrieved 5/7/19 from https://www.nagb.gov/assets/documents/publications/95222.pdf. Wainer, H. (2013). Graphic Discoveries: A Trout in the Milk and Other Visual Adventures. Princeton, NJ: Princeton University Press.

ACKNOWLEDGEMENTS

I have had the privilege of being a witness to much of the history documented in this volume, working with and learning from many of the individuals who have made historic advances in the field of educational measurement—in academia, industry, and educational settings at the federal, state, and local levels. I am grateful to them for allowing me to work alongside them to solve educational measurement problems. I am also grateful to the authors who contributed chapters to this volume and the reviewers who provided so many excellent comments. Many of them have been involved in this project since our original idea for a two-part symposium for the National Council on Measurement in Education in 2017 and have worked closely with me and Brian over the past four years. I am particularly grateful to Hank Scherich, who gave me a chance forty years ago to be a part of this great enterprise. He also encouraged me to share my experiences through papers, articles, workshops, and books. It has been one of the great pleasures of my life to work for him and to see Measurement Incorporated grow from a four-person operation in the basement of a podiatrist’s office to the industry leader it is today. Finally, I am grateful to my wife Kathryn, who has always believed in me and encouraged me to write. She has been my inspiration for over fifty years and a loving critic of my rough drafts. Michael B. Bunch Durham, NC I think it is fair to say that I never would have had a career in educational measurement if I had not met Ron Hambleton. He introduced me to the field, welcomed me into his program at the University of Massachusetts, Amherst, and has been a mentor and friend ever since. I am more pleased than I can say that he

Acknowledgements xxvii

has made a contribution to this volume. Early in my career, I was also fortunate to meet Lee Cronbach. One of the things he taught me was that some very insightful people have been trying to solve the problems of testing and measurement for generations and no matter what new problem you are struggling with there is a good chance that someone has already given it more thought than you have. This led me to start collecting and reading the works of those previous generations, and inevitably to thinking about what was to become this book. I was again fortunate that in 2017 Michael Bunch invited me to participate in the history session he was organizing for the National Council on Measurement in Education. That invitation led to the collaboration that produced this volume. When I think of my career, more than anything else I think of the friendship and kindness that I have received from so many others. I discussed ideas for this book with many of my friends and mentors over the years and I am deeply gratified that so many of them were willing to contribute to this project. I want to express my sincere thanks to the authors and reviewers. Finally, I want to thank the National Board of Medical Examiners and my many friends and colleagues there and most importantly my wife, Suzanne, who shares my love of history and has supported me throughout this project and my life. Brian E. Clauser Media, PA

PART I

Testing Movements

1 EARLY EFFORTS Luz Bay and Terry Ackerman1

The rise of educational assessment in the United States is closely tied to attempts to create a formal educational system. In the 1600s peoples from Europe, including Puritans, Huguenots, Anabaptists, and Quakers, came to North America to escape religious persecution. Once they arrived, daily life was mostly a matter of survival. When education occurred, it was primarily at home. The largest influence on education in the American colonies was religion. This chapter describes the roots of the US educational system and the beginnings of student assessment. In the early 1600s all religious leaders regarded education of their young people as essential to ensuring that youth could read the Bible.

Chronological Perspective To foster a better understanding of education and how primitive classrooms and schools in the early 17th century functioned, we offer the following chronology of technological developments: 1809: first chalkboard in a classroom 1865: quality pencils mass-produced 1870: inexpensive paper mass-produced for students to use in the classroom 1870: steel pens mass-produced, replacing inefficient quill pens 1870: “magic lanterns” used in schools to project images from glass plates 1936: first electronic computer 1960: first “modern” overhead projectors in classrooms 1967: first hand-held calculator prototype “Cal Tech” (created by Texas Instruments) 1969: first form of the internet, ARPANET

4 Luz Bay and Terry Ackerman

Historical Perspective of Education in the American Colonies The first attempt to establish a school in the colonies was in 1619–20 by the London Company, with the goal of educating Powhatan Indian children in Christianity. This attempt was followed by the establishment of the East India School, which focused on educating white children only. In1634, the SymsEaton Academy (aka Syms-Eaton Free School), America’s first free public school, was created in Hampton, Virginia by Benjamin Syms. The mission of the school was to teach white children from adjoining parishes along the Poquoson River. (Heatwole, 1916; Armstrong, 1950). A year later, the Latin Grammar School was founded in Boston (Jeynes, 2007). The school followed the Latin school tradition that began in Europe in the 14th century. Instruction was focused on Latin, religion, and classical literature. Graduates were to become leaders in the church, local government, or the judicial system. Most graduates of Boston Latin School initially did not go on to college, since business and professions did not typically require college training. The Latin School admitted only male students and hired only male teachers until well into the 1800s (Cohen, 1974). The first Girls’ Latin School was founded in 1877. The Boston Latin School is still flourishing today. The first private institution of higher education, Harvard, was established in Newtown (now Cambridge), Massachusetts in 1636. When it first opened, it served nine students. In 1640, a Puritan minister, Henry Dunster, became the first president of Harvard and modeled the pedagogical format on that of Cambridge. In his first several years he taught all courses himself. Harvard graduated its first class in 1642 (Morison, 1930). The Massachusetts Bay colony continued to open schools in every town. One by one, villages founded schools, supporting them with a building, land, and on occasion public funding. In 1647, the colony began to require by law secondary schools in the larger cities, as part of an effort to ensure the basic literacy and religious inculcation of all citizens. The importance of religious and moral training was even more apparent in legislation passed that year that was referred to as the “Old Deluder Satan Act.” The legislation affirmed that Satan intended for people to be ignorant, especially when it came to knowledge of the Bible. The law mandated that each community ensure that their youth were educated and able to read the Bible. It required that each community of 50 or more householders assign at least one person to teach all children in that community (Fraser, 2019). This teacher was expected to teach the children to read and write, and he would receive pay from the townspeople. It should be noted that women teachers did not begin to appear until the Revolutionary War, and then only because of the shortage of males due to the war and labor demands. The contents of the legislation further instructed towns of 100 or more households to establish grammar schools. These schools were designed to prepare children so that they would one day be capable to study at a university (Fraser, 2019).

Early Efforts 5

These laws were not focused so much on compulsory education as on learning. They reasserted the Puritan belief that the primary responsibility for educating children belonged to the parents. Puritans believed that even if the schools failed to perform their function, it was ultimately the responsibility of the parents to ensure that their children were properly educated (Cubberley, 1920). Current research still indicates that there is a strong connection between educational success and parental involvement (Kelly, 2020). Formal education in colonial America included reading, writing, simple mathematics, poetry, and prayers. Paper and textbooks were almost nonexistent, so the assessment of students primarily entailed recitation and subsequent memorization of their lessons. The three most-used formats of instruction were the Bible, a hornbook, and a primer. The hornbook was a carry-over from midfifteenth century England. Early settlers brought them to America. It consisted of a sheet of paper that was nailed to a board and covered with transparent horn (or shellacked) to help preserve the writing on the paper (Figure 1.1). The board had a handle that a student could hold while reading. The handle was often perforated so it could attach to a student’s belt. Typically, hornbooks included the alphabet in capital and small letters, followed by combinations of vowels with

FIGURE 1.1

An example of a typical hornbook (left) and a version of the New England Primer (right)

6 Luz Bay and Terry Ackerman

consonants to form syllables, the Lord’s Prayer, and Roman numerals (Plimpton, 1916, Meriwether, 1907). The New England Primer was the first actual “book” published for use in grammar schools in the late 1680s by Benjamin Harris, an English bookseller and writer who claimed to be one of the first journalists in the colonies. The selections in The New England Primer varied somewhat over time, although there was standard content designed for beginning readers. Typical content included the alphabet, vowels, consonants, double letters, and syllabaries of two to six letters. The original Primer was a 90-page work and contained religious maxims, woodcuts, alphabetical rhymes, acronyms, and a catechism, including moral lessons. Examples of the woodcuts and rhymes to help students learn the alphabet are shown in Figure 1.2. Printing presses did not exist outside of Massachusetts until they appeared in St. Mary’s City (MD) and Philadelphia, in 1685 (Jeynes, 2007). Until the mid-nineteenth century, most teachers in America were young white men. There were, of course, some female teachers (e.g., women in cities who taught children the alphabet, or farm girls who taught groups of young children during a community’s short summer session). But when districts recruited schoolmasters to take charge of their winter session scholars—boys and girls of all ages—they commonly hired men because men were considered more qualified to

FIGURE 1.2

An example of woodcut figures that accompanied rhymed phrases designed to help children learn the alphabet

Early Efforts 7

teach and discipline, an essential ingredient to an efficient school. It was not very common for schoolmasters to teach beyond their late 20s, usually abandoning their teaching career by age 25. The defection rate was extremely high—over 95% within five years of starting. Usually men taught while training for other careers like law or ministry. The wages were low, the work was seasonal, and there were often many better paying opportunities than teaching. Overwhelmingly, teachers felt that absolute adherence to fundamental teachings was the best way to pass on values held in common. If children were disobedient in any way, the teacher could yank them from their benches for the liberal application of the master’s whip to drive the devil from the child’s body. If children did something particularly egregious that interfered with their redemption, or if the schoolmaster was unusually strict, they could be required to sit for a time in yokes similar to those worn by oxen, while they reflected on their transgressions. Tutoring was also a common form of education in the colonial period. Philosophers such as John Locke, Jean-Jacques Rousseau, and William Penn all strongly suggested that public schools were unhealthy and immoral. Penn suggested that it was better to have an “igneous” tutor in his house versus exposing his children to the vile impressions of other children in a school setting. (Gathier, 2008). Many ministers also became tutors. Tutors or governesses had more authority over their students than teachers do today. They could spank or whip the students or sit them in the corner if they misbehaved. When a student talked too much, the tutor placed a whispering stick in the talkative student’s mouth. This stick, held in place by a strip of cloth, eliminated talking. Tutors also used dunce caps and nose pinchers to keep students in line. Another interesting mode of education was sending children to live with another family to receive lodging and instruction. In many cases parents sought to send their children to live with the more affluent so they could acquire not only an education but manners and habits such families could provide. It was also thought that parents could not discipline their children as well as a foster family could. It has been argued that this process was also for economic reasons as the student was also put to work for the acquiring family. Sometimes, to clarify the arrangement, formal agreements were written up. The greater the potential to learn more skills, the more parents would have to pay. That is, an education would be exchanged for child labor. By the early 1700s most children carried a book-sized writing slate with a wooden border. The slate was used to practice writing and penmanship. Typically, a student would scratch the slate with a slate “pencil” or cylinder of rock. Eventually, the slate pencil was replaced by soft chalk. Students were not able to retain any of their work to review or study later. The main pedagogical tool was memorization, and the main form of assessment was through recitation led by the teacher.

8 Luz Bay and Terry Ackerman

Upon the formation of the United States Government, education was taken up by the individual states – the civic purpose superseded the older religious aim. District schools and academies at first were dominant. Gradually graded town schools and public high schools developed. Definite steps were taken toward State direction of education at public expense, under Horace Mann’s influence. Following him came a marked expansion in the scope of public education. (Department of Education, Commonwealth of Massachusetts, 1930) Horace Mann, often referred to as the Father of Public Education, helped to design America’s first public education system, which resulted in the United States becoming one of the most highly educated societies in the world. His influence is still very apparent, with 73 public schools in 22 states and the District of Columbia bearing his name2. More importantly, his six principles of education shaped most of what public education in the United States is to the present day. Although controversial at the time, he believed that 1. 2. 3. 4. 5. 6.

citizens cannot maintain both ignorance and freedom; this education should be paid for, controlled, and maintained by the public; this education should be provided in schools that embrace children from varying backgrounds; this education must be nonsectarian; this education must be taught using tenets of a free society; and, this education must be provided by well-trained, professional teachers.

It is also through Mann’s influence that educational assessment in the United States was transformed from oral to written examinations.

From School Exhibitions to Written Tests One of the earliest formal types of assessment was declamation. This format dated back to ancient Rome and was an integral part of their education system. At the Latin School, students in the upper grades were required to give an oration or declamation in their English class. There was also Public Declamation, in which pupils from all grades or classes were welcomed to try out for the chance to declaim a memorized piece in front of an assembly of teachers and citizens. During Public Declamation, declaimers were scored on aspects such as “Memorization”, “Presentation”, and “Voice and Delivery”. Those who scored well in three of the first four public declamations were given a chance to declaim in front of alumni judges for awards in “Prize Declamation”. In the summer of 1845, students in Boston grammar schools were ambushed, not with guns but with tests. A total of 530 students in Boston and surrounding communities were given the same questions to answer—in writing! This incident,

Early Efforts 9

its precursor, and what happened after were documented in detail in the book Testing Wars in the Public Schools: A Forgotten History (Reese, 2013). Until that fateful day, written tests were unheard of in grammar schools in America. This new method, introduced at Harvard for admission only twelve years prior, was a European import aimed at assessing what students learned. For the first time ever, the highest grammar school classes in Boston and surrounding communities were given a common written test put together by reformers. One of the motivations for the introduction of standardized written tests was to hold powerful grammar school masters accountable for student learning. Before standardized tests were introduced, “[p]ublic performance was everything.” (Reese, 2013, p. 14). The Boston community funded its public schools very well and was very proud of its accomplishments. Every year members of the community would gather to see students in a parade and hear them recite poems and perform music. Students would also answer questions orally in subjects in which they received instruction. Teachers would test students’ memory in a recitation to show that they were proficient in the subject. These events were considered high stakes given that students’ performance on these exhibitions alone made observers “confident they knew whether a school has stagnated, improved, fallen behind, or exceeded expectations.” (Reese, 2013, p. 32) Strong student performance was a sufficient indicator of a well-functioning school while a memorization mishap raised questions about teacher quality. There were criticisms of these exhibitions as not being a sound basis for evaluating student learning, schools, and teachers. One of the fiercest critics of school exhibitions was Horace Mann. In 1837, Mann was appointed secretary of the newly created Massachusetts State Board of Education. He traveled abroad in 1843 to observe/study innovations in European schools. He concluded from that trip that Prussia’s schools were superior to America’s as prominently indicated in the Seventh Annual Report of the Board of Education; Together with the Seventh Annual Report of the Secretary of the Board (Mann, 1844). In this well-publicized report, Mann called for reforms lest schools in Massachusetts deteriorate. The report, by applauding schoolmasters in Prussia, was also very critical of the grammar schoolmasters in Boston. The criticisms leveled upon Boston schoolmasters covered both ineffective teaching practices and the use of corporal punishment in the classrooms. When his friend, Samuel Gridley Howe, was elected to the School Committee and later to the Examining Committee, Mann put in motion a new form of examination—a timed, standardized written test to be given to all first-class (14- and 15year-old) pupils in Boston. According to Mann, the result of the test would replace vague memory of student performance with “positive information in black and white” about what students did and did not know (Reese, 2013, p. 131). He further believed that the results would indicate decline of Boston public schools; thus, supporting his broader reform agenda.

10 Luz Bay and Terry Ackerman

In the summer of 1845, without warning, the Examination Committee came to Boston’s grammar schools with pre-printed written tests. The examiners tested the best 20 to 30 students from each school for a total of 530 students out of Boston’s approximately 7,000 students. The students sat down to respond to shortanswer questions “secretly crafted” by a handful of activists (Reese, 2013, p. 5). There were seven tests altogether with varying numbers of questions requiring short responses that were sampled from assigned textbooks. The number of questions per subject was as follows:       

History: 30 Definitions: 28 Geography: 31 Arithmetic: 10 Grammar: 14 Philosophy: 20 Astronomy: 31

Students had one hour to write their responses for each test. The results were a disaster! Members of the examination committee spent the rest of the summer hand scoring 31,159 responses. The average score was 30%. Reports of test results “show beyond all doubt, that a large proportion of the scholars in our first classes, boys and girls of 14 or 15 years of age, when called upon to write simple sentences, to express their thoughts on common subjects, without the aid of a dictionary or a master, cannot write, without such errors in grammar, in spelling, and in punctuation.” (Travers, 1983, p. 91). Taxpayers were shocked. The test results were used to criticize the teachers and the quality of education the children were receiving. Newspapers published the results including ranked lists of the best and worst performing schools. School masters from some of the lowest performing schools were fired. Calls for reform became overshadowed by the narrative of school decline supported by the abysmal performance on the tests. But the quality of the tests was not good, either. The questions were very hard, and there seemed to be no rationale for some of them. A particularly puzzling question asked students to name the “rivers, gulfs, oceans, seas, and straits, to which a vessel must pass in going from Pittsburgh, in Pennsylvania, to Vienna in Austria.” In Figures 1.3 and 1.4 are images of the Arithmetic and Grammar tests and results printed in the Common School Journal v.7 (Mann, 1845). Although the results were presented with evidence of the examiners’ proclivity towards numerical summaries, it was apparent that they were not benefitting from the beauty and elegance of statistics. This young science proved to be a catalyst for the profound interest in test results in years to come. With the advent of the imposed testing, the first American testing war began. Reese (2013) wrote in a New York Times essay: “What transpired then still

FIGURE 1.3

The first written arithmetic test and result

Early Efforts 11

12 Luz Bay and Terry Ackerman

FIGURE 1.4

The first written grammar test and result

Early Efforts 13

sounds eerily familiar: cheating scandals, poor performance by minority groups, the narrowing of curriculum, the public shaming of teachers, the appeal of more sophisticated measures of assessment, the superior scores in other nations, all amounting to a constant drumbeat about school failure.” When results of the tests were reported, Mann (1845) claimed seven major advantages of written over oral examinations: (1) the same questions being given to students from all schools, it is possible to evaluate the students and their schools impartially (and, indeed, the Boston examiners were at least as interested in using the examination results to measure how well the various schools were fulfilling their mission as they were in assessing individual students—a common use of examinations that persists today); (2) written tests are fairer to students, who have a full hour to arrange their ideas rather than being forced, when a whole class is being examined orally, to display what they know in at most two minutes of questioning; (3) for the same reason, written examinations enable students to express their learning more thoroughly in response to a wider range of questions; (4) teachers are unable to interrupt or offer suggestions to examinees; (5) there is no possibility of favoritism; (6) the development of ideas and connecting of facts invited in more extensive written answers makes it easier to evaluate how competently the children have been taught than is possible with brief, factual oral responses; and (7) “a transcript, a sort of Daguerreotype likeness, as it were, of the state and condition of the pupils’ minds is taken and carried away, for general inspection,” and this almost photographic image, permanent because written, enables the establishment of objective standards for the accurate comparison of examinees and their schools. (Hanson, 1993) The tests were given again in 1846 and in subsequent years, but by 1850, Boston had abandoned its strategy and reverted to nonstandardized exams that were mostly based on oral presentations (Travers, 1983, p. 92). But the floodgates were open. Within two decades written tests had clearly taken hold, not only in Boston, but across most of the United States. By 1866 for example, a school superintendent in Cleveland was preparing “thirty-four different sets of printed questions” (Reese, 2013, p. 174). This trend on the proliferation of testing continued. An example of an eighth-grade final exam from 1895 from Salina, Kansas is shown in Figure 1.5. In addition to the continuation of the popularity of the new examination mode, an entire new publishing industry was born—guides, “keys,” and books filled with exam questions. With the pressure students feel when taking a written test, a logical antidote is preparation, even extreme preparation. Educators of the time were familiar with “special teachers—‘crammers’”—who prepped pupils to sit for exams in Oxford

Arithmetic (Time, 1.25 hours) 1. Name and define the Fundamental Rules of Arithmetic. 2. A wagon box is 2 ft. deep, 10 feet long, and 3 ft. wide. How many bushels of wheat will it hold? 3. If a load of wheat weighs 3942 lbs., what is it worth at 50 cts. per bu, deducting 1050 lbs. for tare? 4. District No. 33 has a valuation of $35,000. What is the necessary levy to carry on a school seven months at $50 per month, and have $104 for incidentals? 5. Find cost of 6720 lbs. coal at $6.00 per ton. 6. Find the interest of $512.60 for 8 months and 18 days at 7 percent. 7. What is the cost of 40 boards 12 inches wide and 16 ft. long at $.20 per inch? 8. Find bank discount on $300 for 90 days (no grace) at 10 percent. 9. What is the cost of a square farm at $15 per acre, the distance around which is 640 rods? 10. Write a Bank Check, a Promissory Note, and a Receipt.

Grammar (Time, one hour) 1. Give nine rules for the use of Capital Letters. 2. Name the Parts of Speech and define those that have no modifications. 3. Define Verse, Stanza and Paragraph. 4. What are the Principal Parts of a verb? Give Principal Parts of do, lie, lay and run. 5. Define Case, Illustrate each Case. 6. What is Punctuation? Give rules for principal marks of Punctuation. 7–10. Write a composition of about 150 words and show therein that you understand the practical use of the rules of grammar.

8th Grade Final Exam: Salina, Kansas – 1895

1. 2. 3. 4. 5.

Geography (Time, one hour) What is climate? Upon what does climate depend? How do you account for the extremes of climate in Kansas? Of what use are rivers? Of what use is the ocean? Describe the mountains of N.A. Name and describe the following: Monrovia, Odessa, Denver, Manitoba, Hecla, Yukon, St. Helena, Juan Fernandez, Aspinwall and Orinoco.

Orthography (Time, one hour) 1. What is meant by the following: Alphabet, phonetic-orthography, etymology, syllabication? 2. What are elementary sounds? How classified? 3. What are the following, and give examples of each: Trigraph, subvocals, diphthong, cognate letters, linguals? 4. Give four substitutes for caret ‘u’. 5. Give two rules for spelling words with final ‘e’. Name two exceptions under each rule. 6. Give two uses of silent letters in spelling. Illustrate each. 7. Define the following prefixes and use in connection with a word: Bi, dis, mis, pre, semi, post, non, inter, mono, super. 8. Mark diacritically and divide into syllables the following, and name the sign that indicates the sound: Card, ball, mercy, sir, odd, cell, rise, blood, fare, last. 9. Use the following correctly in sentences, Cite, site, sight, fane, fain, feign, vane, vain, vein, raze, raise, rays. 10. Write 10 words frequently mispronounced and indicate pronunciation by use of diacritical marks and by syllabication.

14 Luz Bay and Terry Ackerman

6. Name and locate the principal trade centers of the U.S. 7. Name all the republics of Europe and give capital of each. 8. Why is the Atlantic Coast colder than the Pacific in the same latitude? 9. Describe the process by which the water of the ocean returns to the sources of rivers. 10. Describe the movements of the earth. Give the inclination of the earth.

An example of grade 8 final exam from 1859 Salina, Kansas

Source: http://www.kubik.org/lighter/test.htm

FIGURE 1.5

This is the eighth-grade final exam from 1895 from Salina, Kansas. It was taken from the original document on file at the Smoky Valley Genealogical Society and Library in Salina, Kansas and reprinted by the Salina Journal.

U.S. History (Time, 45 minutes) 1. Give the epochs into which U.S. History is divided. 2. Give an account of the discovery of America by Columbus. 3. Relate the causes and results of the Revolutionary War. 4. Show the territorial growth of the United States. 5. Tell what you can of the history of Kansas. 6. Describe three of the most prominent battles of the Rebellion. 7. Who were the following: Morse, Whitney, Fulton, Bell, Lincoln, Penn, and Howe?

8th Grade Final Exam: Salina, Kansas – 1895

Early Efforts 15

16 Luz Bay and Terry Ackerman

and Cambridge in England (Reese, 2013, p. 222). American educators worried that hiring teachers whose primary, if not solitary, task was preparing students to take the test, would degrade teaching as a profession. This sentiment led to the adoption of the England model and spawned the test preparation industry. By the 1870s, strong opposition to the written test had taken root. Test preparation, administration, scoring, and reporting of results were significant undertakings. Officials received an increasing number of complaints from parents whose children experienced nervous attacks as a result of taking the tests. Not surprisingly, teachers figured heavily in this opposition. Not only did written tests increase their workload, they also affected the curriculum and their pedagogy. Additionally, they also felt surveilled and controlled by this new “objective” measure of their efficiency and effectiveness. The prestigious Journal of Education had to confess in 1884 that “written examinations were the ‘greatest evil’ in the schools and created perpetual ‘warfare between the teacher and examiner’” (Reese, 2013, p. 199).

The New York Regents Examinations Within two decades of the introduction of written examinations to American public schools, the longest continuously operational testing program in the country was established. On July 27, 1864, the Board of Regents of the state of New York passed an ordinance stating that At the close of each academic term, a public examination shall be held of all scholars presumed to have completed preliminary studies. … To each scholar who sustains such examination, a certificate shall entitle the person holding it to admission into the academic class in any academy subject to the visitation of the Regents, without further examination. (OSA: NYSED, 1987) The first Regents examinations were administered to eighth graders in November 1865 as a high school entrance examination. The amount of state funding for academies was based on the number of students enrolled, and the state needed a way of determining the number of bona fide academy students. (The University of the State of New York, 1965) The exams were generally essay questions designed to determine whether students were academically prepared. Building on the relative success of the high school entrance exam, there soon arose the examination program for high school graduation and college admission. In June 1878, the modern system of high school examination was administered for the first time. About 100 institutions participated. The five tests covered algebra, American history, elementary Latin, natural philosophy, and physical geography. In 1879, after evaluating the results of the first administration, the Board of Regents approved a series of 42 examinations for secondary schools to be given in November, February, and June of each year (OSA: NYSED, 1987).

Early Efforts 17

A likely cause of the longevity of the Regents examination program is that it evolved as a collaborative effort among teachers, school superintendents and state employees. As such, it was rooted in the existing curriculum and yet was designed to incorporate change. Because it has been around so long, it is fascinating how questions asked have changed over the years. The oldest Regents exam that is accessible is the Physical Geography examination administered in June 1884 (NYSL: NYSED, 2019 September 20).3 There were limited directions given to the students other than that they would have two and half hours to write their responses. Other directions dealt only with proper identification on submitted materials. Is this indicative that impersonation was the popular test administration irregularity in those days? There were 21 items on the test, each question requiring a short answer, although some items had multiple parts. There was an allocated number of points for the correct answers on the test for a total of 48 points. It was stated clearly on the test that 36 points was the passing score, which is 75% of the total possible points. This author has failed to find information on how the Regents determined the passing score. The Physical Geography examination administered in June 1895 is available from the New York State Library.4 The amount of time allocated was three hours. There were no directions pertinent to student identification—they might have used a proctor script. There were only 15 items on the test and fully correct responses received 10 points. One important feature of this test was the implementation of a self-adaptive element. Of the 15 test items, students were instructed to respond to only 10. The total score was 100, and the passing score again was 75. Other than the last question, which might not pass a modern bias and sensitivity review, the later test seems to be better thought out than its predecessor. The test developers of the Regents Exam likely learned many lessons about designing items and making tests that improved their instruments over time. Over the next dozen or so Physical Geography tests through 1940 (NYSL: NYSED, 2019 September 20), NYSED incorporated different item types such as fill in the blank and matching-type responses, which resulted in more objective item types. The directions became more specific, and use of multiple-measures apparent with the combination of scores on the test and some indicator of performance in the laboratory portion as the determinants of passing, which remained equivalent to 75%. An example of directions from a 1934 Physical Geography test is shown below: Write at top of first page of answer paper (a) name of school where you have studied, (b) number of weeks and recitations a week in physical geography, with the total number of laboratory periods and the length of such periods. A paper lacking the statement of laboratory work will not be accepted at a standing of less than 75 credits.

18 Luz Bay and Terry Ackerman

The minimum time requirement is five recitations a week for a school year. An unprepared laboratory exercise of two periods counts in place of one recitation. At least 30 laboratory exercises are required.5

From Phrenology to Intelligence Testing Concurrent with the advent of written examinations in American schools in the 1800s, the pseudo-sciences of craniology and phrenology were on the rise. Craniology is the study of the shape, size, and proportions of the skulls of different human races. Phrenology is the study of the same attributes as they relate to character and intelligence. The timing of these historic events may be coincidental, but some historians contend otherwise, as Horace Mann and Samuel Gridley Howe were proponents of phrenology. For Mann and Howe, “the natural laws and moral imperatives of phrenology justified a secular scientific curriculum and a ‘softer’ child-centered pedagogy as the means of correctly training a rational and virtuous citizenry.” (Tomlinson, 2005, p. xiv). Tomlinson (2005) details how phrenology was used to justify the sweeping social reforms that Mann and Howe espoused. On the other side of the pond, [w]hen Alfred Binet (1857–1911), director of the psychology laboratory at the Sorbonne, first decided to study the measurement of intelligence, he turned naturally to the favored method of a waning century and to the work of his great countryman Paul Broca [(1824–1880)]. (Gould, 1981, p. 146) Broca was a strong believer in the direct relationship between human intelligence and the sizes of brain and skull—a belief broadly held in those days (Rushton and Ankney, 2007). Following the precedents of an American physician and anatomy professor who became well-known for his large collection of human skulls, Samuel Morton (1799–1851), Broca developed numerous techniques to study the form, structure, and topography of the brain and skull in order to identify and differentiate different human races. Broca’s studies were cited by Charles Darwin (1871) in support of the theory of evolution in his book The Descent of Man (Rushton and Ankney, 2007). Furthermore, Broca had originated methods of establishing the ratio of brain to skull—information that was later used by Italian physician and anthropologist Cesare Lombroso (1835–1909). Lombroso, in his book Criminal Man published in 1897 “suggested that detectable physiognomic and cranial traits could identify people who were born to offend” (Sirgiovanni, 2017, p. 166). Binet initially set out to measure intelligence by following Broca’s “medical” approach. He went from school to school making Broca’s recommended

Early Efforts 19

measurements on heads of pupils designated by teachers as their most and least intelligent. He found the differences much too small to be meaningful and was concerned that the source of even these minor differences was his own suggestibility. He suspended this line of study and came back to measuring intelligence through “psychological” methods. With his student Theodore Simon (1873–1961), Binet constructed “a set of tasks that might assess various aspects of reasoning more directly” (Gould, 1981, p. 149) and the Binet-Simon Scale was born. The scale was intended only to identify children in need of special help, a very humane goal. The harmful hereditarian interpretation was introduced by American psychologists when the test came to America. (Gould, 1996).

Conclusion Four hundred years ago, newcomers to this continent sought shelter and sustenance for themselves and then education for their children. Having established that enterprise, they sought to make sure that the individuals teaching their children and the institutions in which their children were being taught were performing up to a reasonable standard. To reach and maintain those standards, pioneers like Horace Mann and members of the New York State Board of Regents imposed standardized assessments and systems of administering them and using their results. Those early exams were certainly crude by today’s standards, but they were miles ahead of what they replaced. They opened the door for more sophisticated approaches over the next two centuries. They also laid the foundation for the introduction of ability testing as devised by Alfred Binet and others, an enterprise that would have a lasting impact on American life, as we shall see in subsequent chapters.

Notes 1 We would like to thank our respective spouses, Michael Nering and Deb Ackerman, for their immeasurable patience during the writing of this chapter, and Mike Beck for a thoughtful review and comments on an earlier draft of this chapter. 2 https://www.hmleague.org/horace-mann-schools/ 3 https://nysl.ptfs.com/data/Library1/Library1/pdf/7590547_Physical-Geography-Jun-1884. pdf 4 https://nysl.ptfs.com/data/Library1/Library1/pdf/7590547_Physical-Geography-Jun-111895.pdf 5 https://nysl.ptfs.com/data/Library1/113643.PDF

References Armstrong, F. M. (1950). The Syms-Eaton Free School: Benjamin Syms, 1634. Thomas Eaton, 1659. Houston, TX: Houston Print. and Pub. House. Cohen, S. (1974). A History of Colonial Education, 1607–1776. New York, NY: John Wiley and Sons,

20 Luz Bay and Terry Ackerman

Cubberley, E. P. (1920). The History of Education. Cambridge, MA: Riverside Press. Darwin, C. (1871). The Descent of Man, and Selection in Relation to Sex, two volumes, London: John Murray. Reprinted, J. Moore and A. Desmond (Eds.), London: Penguin Classics, 2004. Retrieved from http://darwin-online.org.uk/EditorialIntroductions/ Freeman_TheDescentofMan.html. Department of Education, Commonwealth of Massachusetts (1930). The Development of Education in Massachusetts, 1630–1930. In Selections from Archives and Special Collections, Bridgewater State University. Item 5. Available at: http://vc.bridgew.edu/selections/5. Fraser, J. (2019). The School in the United States: A Documentary History, 4th edition. Abingdon, UK: Routledge. Gathier, M. (2008). Homeschool: An American History. New York, NY: Macmillan. Gould, S.J. (1981). The Mismeasure of Man. New York, NY: Norton. Gould, S.J. (1996). The Mismeasure of Man (revised and expanded). New York, NY: Norton Hanson, A. F. (1994). Testing Testing: Social Consequences of the Examined Life. Berkeley, CA: University of California Press. Harper, E. P. (2010). Dame Schools. In T. Hunt, T. Lasley, & C. D. Raisch (Eds.), Encyclopedia of Educational Reform and Dissent, (pp. 259–260). Thousand Oaks, CA: SAGE Publications. Heatwole, C. (1916). A History of Education in Virginia. New York, NY: Macmillan. Jeynes, W. H. (2007). American Educational History: School, Society, and the Common Good. Thousand Oaks, CA: Sage Publications. Johnson, C. S. (2009, December 31). History of New York State Regents Exams. Retrieved from http://files.eric.ed.gov/fulltext/ED507649.pdf. Kelly, M. (2020, February 11). Parent Role in Education is Critical for Academic Success. Retrieved from https://www.thoughtco.com/parent-role-in-education-7902. Mann, H. (1844). Seventh Annual Report of the Board of Education; Together With the Seventh Annual Report of the Secretary of the Board. Boston, MA: Dutton & Wentworth State Printers. Available at https://hdl.handle.net/2027/chi.18465022?urlappend=%3Bseq=6. Mann, H. (Ed.) (1845). Common School Journal, volume 7, No. XIX—XXIII. Meriwether, C. (1907). Our Colonial Curriculum 1607–1776. Washington, DC: Capital Publishing Co. Morison, S. E. (1930) Builders of the Bay Colony, [chapter entitled “Henry Dunster, President of Harvard”, pp. 183–216] Nedelsky, L. (1954). Absolute grading standards for objective tests. Educational and Psychological Measurement, 14 (1), 3–19. New York State Library: New York State Education Department (NYSL: NYSED). (2019, September 20). New York State Regents Exams (PDF Files). Retrieved from http://www.nysl.nysed.gov/regentsexams.htm. Office of Student Assessment: New York State Education Department (OSA: NYSED). (1987, November 24). History of Regents Examinations: 1865 to 1987. Retrieved from http://www.p12.nysed.gov/assessment/hsgen/archive/rehistory.htm. Plimpton, George A. (1916). The hornbook and its use in America. Proceedings of the American Antiquarian Society, 26: 264–272. Reese, W. J. (2013). Testing Wars in the Public Schools: A Forgotten History. Cambridge, MA: Harvard University Press. Reese, W. J. (2013, April 20). The first race to the top. New York Times. p. SR8. Retrieved from https://www.nytimes.com/2013/04/21/opinion/sunday/the-first-tes ting-race-to-the-top.html.

Early Efforts 21

Rushton, J. P., & Ankney, C. D. (2007). The evolution of brain size and intelligence. In S. M. Platek, J. P. Keenan, & T. K. Shackelford (Eds.), Evolutionary Cognitive Neuroscience (pp. 121–161). Cambridge, MA: MIT Press. Sirgiovanni, E. (2017). Criminal heredity: the influence of Cesare Lombroso’s concept of the “born criminal” on contemporary neurogenetics and its forensic applications. Journal of History of Medicine, 29 (1). 165–188. Tomlinson, S. (2005). Head Masters: Phrenology, Secular Education, and Nineteenth Century Social Thought. Tuscaloosa, AL: The University of Alabama Press. Travers, R. M. W. (1983). How Research Has Changed American Schools: A History From 1840 to the Present. Kalamazoo, MI: Mythos Press. The University of the State of New York. (1965). Regents Examinations 1865–1965: 100 Years of Quality Control in Education. Albany, NY: The University of the State of New York and the State Education Department.

2 DEVELOPMENT AND EVOLUTION OF THE SAT AND ACT Michelle Croft and Jonathan J. Beard1

College admissions tests were initially developed as a mechanism to standardize the college admissions process and provide students greater access to highly selective institutions. Over time and in response to societal changes, the testing programs have shifted to accommodate a broader range of college applicants as well as a broader range of uses of the test scores. As the origins of the SAT and the ACT test (hereinafter “ACT”) have been well documented, this chapter will provide a brief summary on the origins of the two assessments but primarily focus on these shifts in society and in the testing programs, as well as on access to postsecondary education and to the tests themselves. The rest of this chapter is organized using broad themes that will structure the discussion of college admissions tests and their place in the modern era: assessment origins; assessment transparency (including test preparation, test security, truth in testing); detection and removal of cultural bias; the use of college admissions tests for K–12 federal accountability; and access for students with disabilities and English language learners. Last, we offer a brief discussion of lessons learned from the constraints and demands placed on the testing profession.

SAT Development At the time of the inception in November 1900 of The College Entrance Examination Board, test developer of the SAT, there was effectively no uniformity in the college admissions process in terms of both the subject matter content needed and the degree of proficiency required to be successful in postsecondary education. It was difficult for students to know the standard that would be applied to them for college admissions. For college administrators, the variability among secondary schools and students led to the development of unique sets of exams in multiple

Development of the SAT and ACT 23

subjects for each institution. The College Board exams were designed to bring a standard of uniformity to the selection process. The content of the exam was to be uniform in terms of subject matter, the exams were to be uniformly administered in terms of date and exam timing across multiple locations, and all responses were to be scored in a uniform way to the same standard. The exam results would serve to provide some economy of force with respect to admissions decisions and clarity of expectations on the part of students. The production of the exams was somewhat secondary to the overarching goal of providing a means of communication between schools and colleges moving toward a more uniform secondary school curriculum (Angoff & Dyer, 1971, p. 1). However, without exams serving as a unifying aspect, such communication and uniformity were unlikely to be achieved. Although there was some tension among schools, colleges, and the Board regarding curricula and course content, the implementation of the exams reflected a genuine if ambitious desire to bring “law and order into an educational anarchy” with regard to the subject matter of postsecondary preparation and the degree to which the subject matter should be mastered (Angoff & Dyer, 1971, pp. 1–2). The first “College Boards” were given in 1901 as all-essay responses in the subjects of English, French, German, Latin, Greek, history, mathematics, chemistry, and physics, given over the course of several days. In the next iteration, Spanish, botany, geography, and drawing were added. Scores were given as ratings: Excellent, Good, Doubtful, Poor, and Very Poor. Corresponding percentage scores were given as 100– 90, 89–75, 74–60, 59–40, and less than 40. While initially designed to test students’ factual subject knowledge, by 1925 there had been a pivot to an approach in which students would apply general knowledge and working principles to novel situations (Angoff & Dyer, 1971, p. 2). The change prompted some pushback on the part of stakeholders out of concern that the changes would remove the emphasis from subject matter knowledge and mastery to more of a “superficial cleverness” (Angoff & Dyer, 1971, p. 2; see also, Donlon, 1984; Lawrence et al., 2003; Zwick, 2002). The first multiple-choice Scholastic Aptitude Test (SAT) was administered in 1926, to just over 8,000 examinees. The change was, in part, to be cost- and time-efficient, as well as to remove subjective judgment in the scoring process (see e.g., Trachsel, 1992.) Scores were given on nine subtests: Definitions, Arithmetical Problems, Classification, Artificial Language, Antonyms, Number Series, Analogies, Logical Inference, and Paragraph Reading. Although the name and intent of the test was to “[distinguish] it from tests of achievement in school subjects” (Angoff & Dyer, 1971, p. 2), the developers cautioned that the test was not designed to measure general intelligence or “mental alertness.” Even at its humble outset, discussions of the college admissions test included warnings about placing too great an emphasis on test scores, stating that to do so “is as dangerous as the failure [to] properly evaluate any score or rank in conjunction with other measures” (Angoff & Dyer, 1971, p. 2). The SAT score was best regarded as a complement to the applicant’s educational experience.

24 Michelle Croft and Jonathan J. Beard

Between 1929 and 1941, many of the most familiar features of the SAT were established with a few broad changes to the test: 1. 2. 3. 4.

In April 1929, the test was divided into two major sections: verbal and mathematical. In 1937, a second administration of the SAT was offered in June. In 1941, the SAT scale (200–800) was established using the April administration (about 11,000 examinees). Also, in 1941, the regular procedure of equating June scores to the April form was introduced.

Other notable changes (e.g., to test parts and allotted time) occurred over the much longer period of 1941–2002. A summary of all the changes to the SAT is shown in Figure 2.1.2 Throughout the SAT’s evolution, it should be noted that early writings about the SAT indicate that something other than specific content mastery was part of the test. The best description regarding the intent of the SAT to complement a student’s educational experience is provided by Donlon & Angoff (1971): The SAT was, in a sense, intended to provide some redress for possible errors and inconsistencies in secondary school records and in the old essay examinations that were tailored to specific curriculums. By stressing the direct measurement of basic abilities, the rationale was that it would offer an opportunity for a more balanced assessment of the student who had failed to achieve subject-matter mastery in keeping with his development of these basic abilities. (p. 15) This line of thinking is understandable, as the SAT was part of College Board’s Admission Testing Program, consisting of the SAT and several achievement tests. The achievement tests were much more focused on content mastery within a very specific domain. Fremer and Chandler note that the usefulness of the SAT as an indicator of a student’s potential for college work depends in large measure on the fact that the SAT measures general ability as it has developed over the full range of experiences in a person’s life (1971, p. 147, emphasis added) In a similar vein, Coffman notes that the SAT was designed to identify students who possess the skills needed to do college-level work regardless of what they may have studied in high school (1971, p. 49). At the outset, the goal of having a test composed of items that measured attainment in a given domain, without being overly contingent on the specific

FIGURE 2.1

x x x x x x x

x x x x

x x x x x x x

Summary of changes to the SAT

x x x x x x x

x x x x x x x x x x x x x x x

x x x x x x x x x x x x x

x x

x x x x x x x x x x x x x x x x x x x x x x x x x x x

x x x x x x x x x x x x x x x x x x x x x x x x x

x x x x x x x x x x x x x x x x x x x x x

x x x x x x x x x x x x x x x x x x x

x x x x x x x x x x x x x x x x x x x x x x x x

x x x x x x x x x x x x x x x x x x x x

x x x x

x

x x x x x x x x x x x x x x x x x x x x x x x x x x

Arithmetic word problem x x Data sufficiency Number series completion x x Quantitative comparison Student produced response

Math Problems: fill in Problems: 6-choice Problems: 5-choice

Double definition Sentence completion Definitions Classification Artificial language Logical inference Synonyms (2 choice)

Paragraph reading Reading comprehension Critical reading

Analogies Select 4th term Select 2 terms Select 2nd pair

Verbal Antonyms 6-choice 5-choice Parts of speech

Year (19XX) 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 74 75 85 94 95 02

Development of the SAT and ACT 25

26 Michelle Croft and Jonathan J. Beard

curriculum a student may have been exposed to, was appealing; developing item types that function well for that purpose would be considered a worthy endeavor. However, the goal of having items that functioned well psychometrically, and were not completely subject to differences in formal instruction, did not come without a cost. A natural tension seemed to exist between the traits that were being measured and the types of items that were included on previous versions of the SAT. On the one hand, the questions were written in such a unique way that a reasonable amount of content knowledge and problem-solving ability had to be brought to bear to answer them. On the other hand, because of their relative uniqueness in terms of how the questions were asked, these item types lent themselves to being somewhat coachable (Gladwell, 2001; Slack & Porter, 1980a; Jackson, 1980; Slack & Porter, 1980b; DerSimonian & Laird, 1983). However, it is important to note that having students become familiar with the nature of the test was fundamental to the early SAT. From 1926 through 1944, students were required to show a completed practice test in order to take the exam for admission purposes (Fremer & Chandler, 1971). After 1944, examinees no longer had to demonstrate completion of a practice test, and descriptive information and sample questions were made available to students. As noted in Figure 2.1, many changes and modifications have been made to the SAT testing program. A number of the changes have moved the test further from “puzzle-solving” question types to ones that require knowledge and reasoning skills closely related to those experienced in school coursework (Lawrence et al., 2003). With each iteration, the test changed in a way such that the interpretations of the test scores could be more informative and ultimately more useful for test score users. Some of the changes were in response to external demands or constraints, whereas other updates were based on changes in curricular emphasis. Lawrence et al. (2003) provide an exceptionally well written summary of the changes made to the test up to that time, and the reasoning for why particular modifications were made.3 The latest redesign of the SAT in 2016 encompassed a comprehensive and deliberate move toward clear, specific, and transparent aspects of the SAT testing program (College Board, 2015b). Broadly, the redesign removed the required essay component that had been added in 2005, added subscores, reduced answer options from five to four, and ended the scoring practice in which fractions of points were subtracted for wrong answers (i.e., formula scoring; College Board, 2015b). Lastly, a distinctive element of the 2016 redesign was the change in test specifications to focus more on rigorous coursework and problem solving in a college-readiness context (College Board, 2015a). Lawrence et al. (2003) succinctly articulate the importance of change when necessary: any testing program must strive to be as fair as possible to examinees and as informative as possible for college admissions decisions. The latest changes to the SAT with respect to transparency, as well as near universal access to practice and review materials, meet both goals of fairness to students as well as usefulness to colleges.

Development of the SAT and ACT 27

ACT Development While use of the SAT had become common at highly selective postsecondary institutions in the northeastern United States during the 1950s, many regions and institutions were without a standard college admissions test. This situation became especially problematic after World War II, when there was an increase in the number of postsecondary applicants, and these applicants were much more diverse than those attending elite institutions using the SAT (Lazerson, 1998). The increase was due in part to the growing US economy, an increase in the population generally, and additional federal money being spent on higher education that made it possible for more students to attend college. For instance, the G.I. Bill of 1944 (Servicemen’s Readjustment Act, Pub. L. 78– 346) expanded opportunity to higher education by providing veterans with financial aid. Similarly, the National Defense Education Act (NDEA), enacted in 1958 (Pub. L. 85–864), provided low-interest loans for higher education and eventually evolved into the National Direct Student Loan Program and the Perkins Loan Program. Further, the Great Society programs of the 1960s included additional federal initiatives to expand access to higher education. For example, the Economic Opportunity Act of 1964 (Pub. L. 88–452) authorized grants that became college access programs such as Upward Bound, which provides college admissions support to high school students from low-income families and from families in which neither parent holds a bachelor’s degree. The Economic Opportunity Act also funded Work/Study Programs to promote part-time employment of college students from low-income families in need of earnings to help cover college costs (Pub. L. 88–452, Sec. 121). In response to the increase in college applicants, states started to create their own admissions tests as the SAT was not marketed toward those institutions (ACT, 2009). The tests were of varied quality and made it difficult for out-ofstate students interested in attending other states’ colleges to participate. In addition to assisting with admissions decisions, universities needed assistance with placement decisions once the student enrolled. There was also a concern that aptitude tests—as the SAT was designated at the time—were not the appropriate tool for determining college admissions on a large scale. Instead, there should be a tool that could “gauge students’ readiness and ability to perform college-level activities” (ACT, 2009, p. 9). For these reasons, the ACT was developed.4 It grew out of the Iowa Academic Meet, which was the first Iowa test program for high schools. The Meet included multiple rounds of testing to identify outstanding scholars, and the first round— the “Every-Pupil Test”—was administered to every high school student enrolled in the tested subject. The purpose of assessing all students was not only to identify outstanding scholars but also to raise the standards of instruction and to interest students in the subject matter (Lindquist, 1976). The Iowa Academic Meet eventually grew into the Iowa Tests of Basic Skills and the Iowa Test of Educational Development (ITED). The new tests were

28 Michelle Croft and Jonathan J. Beard

designed to emphasize “the development of skills and generalized abilities, as opposed to rote learning of subject matter” and to require problem solving and critical thinking (ACT, 2009, pp. 2–3). The ACT was first administered in 1959. The pace of its development was supported by the ability to leverage the ITED and pre-test items during the administration of the ITED. The organization also benefited from the technical capability of a newly developed optical scanner, which had the ability to process tens of thousands of answer sheets a year. The first ACT included four sections: English, Mathematics, Social Studies, and Science. The Social Studies and Science tests emphasized reasoning and problem solving through the interpretation of readings in the relevant subjects. The test took three hours, with 45 minutes for each section, and the scores were reported on a 0-to-36 scale, with a mean composite score for college admissions purposes. The ACT grew quickly. In the 1959–60 school year, 132,963 students took the test (ACT, 2009); in 1962–63, 368,943 students took it; and 961,184 students took the test in 1967–68. Over the years, there were minor changes to the ACT. After reviews of textbooks for grades 7–12 and interviews with experts, there were multiple, more significant, changes to the test in 1989 (ACT, 2009): it was updated to provide subscores that could be helpful for course placement purposes; the Social Studies test was replaced by a Reading test that measured “pure” reading ability (ACT, 2009, p. 64); and the Science section was changed to focus on the scientific process. ACT continues to update the test based on changes in high school curriculum and what is needed for college. Currently, the ACT uses a National Curriculum Survey administered approximately every three to five years to identify the most current college-level coursework expectations and high school curriculum content (ACT, 2016).

Transparency: Test Preparation, Test Security, and Truth in Testing Despite the growing number of students taking the SAT and ACT post-World War II, students—and the general public—did not know much about the tests. Information about scoring procedures and validity studies was not generally released (Robertson, 1980). Students were not privy to copies of their answers after the tests, nor were the tests themselves publicly released. The lack of information about the tests led to the rise of test-preparation companies, many of which were making false and misleading advertising claims (see Haney, Madaus, & Lyons, 1993; Zwick, 2002). Ultimately, the Federal Trade Commission conducted an inquiry related to SAT and Law School Admission Test preparation companies in 1978. One outcome of the rise of test-preparation companies was the implementation of test security measures. Although some test prep companies published collections of items that were only purportedly included on the tests, others

Development of the SAT and ACT 29

published items taken directly from the tests, raising concerns about test security (Angoff & Dyer, 1971). As a result, College Board began to implement procedures to secure the SAT item pool (Angoff & Dyer, 1971). Some of these measures included the use of sealed, numbered item booklets and of procedures around shipping and storing the booklets. In instances where scores were questioned, students were offered the chance to retake the test to confirm the scores received originally. ACT introduced similar procedures at about the same time (ACT 2009, p. 64). Since the introduction of test security measures, both organizations have continued to refine and adapt test administration policies with the goal of preventing test irregularities; however, as not all irregularities can be prevented, both organizations have adopted additional mechanisms to detect breaches if they do occur. Another outcome was the introduction of truth-in-testing laws. Two states— New York and California—enacted legislation which required test providers to provide more information about the tests to the public.5 New York’s law was passed in 1979 and required providers to file with the New York Commissioner of Education “all test questions used in calculating the test subjects’ raw score” and “the corresponding acceptable answer to those questions,” as well as data about test takers (NY Educ. Code § 341, 342). College Board, along with the Graduate Management Admissions Council, the Test of English as a Foreign Language Policy Council, and the Educational Testing Service, filed a lawsuit in 1990 for a preliminary injunction claiming that the law violated the Copyright Act of 1976 (College Entrance Examination Bd. v. Pataki, 889 F.Supp.554, 1995 N.D.N.Y).6 The plaintiffs claimed that the disclosure provisions had forced them to reduce the yearly number of test dates offered in New York—even as they increased the number of dates offered in all other states—because the reduction would require them to release a smaller number of test forms each year. On a motion for reconsideration, the court specified the number of forms that must be released annually (College Entrance Examination Bd. v. Pataki, 893 F.Supp. 152, NYND, July 26, 1995). The court decision was codified in statute in 1996 (NY Educ. Code § 342). The statute was later changed to require the release of either twothirds of tests administered in the test year or, for the entities that were included in the lawsuit, a specific number of test forms depending on the number of test dates offered in the state. California enacted a truth-in-testing law in 1978 (Cal Ed. Code § 99150– 99160). The original version of the law differed from New York’s in that test providers could file with the state educational commission either a completed sample test or list of “representative” test questions and answers as opposed to actual test forms (Cal. Educ. Code § 99152; Robertson, 1980). Later, the law was changed to require release of a certain proportion of test forms (Cal. Educ. Code § 99157). In addition to release of test content (either complete or representative), both states required disclosures related to scoring methodology and reporting

30 Michelle Croft and Jonathan J. Beard

(Robertson, 1980; The Educational Testing Act of 1979, 1980). California also required financial disclosures about fees and expenses for each testing year. Although ACT was not a party to the 1990 lawsuit, as a result, it reduced the number of its test dates in New York to comply with the law (ACT, 2009, p. 50). It was not until 2019 that a minor update to the law’s release requirement allowed ACT to offer an additional test date (New York S.B. 8639, 2018; ACT, 2018). Additionally, in response to these laws, College Board and ACT both began releasing items and test forms not just to students in California and New York but nationally. College Board made a sample set of questions available to the public in 1978 (Valley, 1992), and a year after allowing New York SAT examinees to receive a copy of their exam, answer sheet, and correct answers in 1980, it extended the opportunity to examinees in other states (Fiske, 1981). Similarly, in December 1982, ACT began selecting national test dates when students could receive copies of their test questions and answers (ACT, 2009). After first publishing a booklet that included sample questions and general testtaking information in the early 1980s, ACT began to share items with various publishers for the creation of test preparation materials (ACT, 2009). In 1985, ACT offered a free publication, Preparing for the ACT Assessment, which included a full test form with answer key and a worksheet to help students analyze their strengths and weaknesses on each section of the test. Both organizations have continued to offer free test prep materials and have expanded free test prep opportunities. For example, College Board partners with Khan Academy to offer personalized study plans based on students’ performance on past SAT results or diagnostic quizzes (College Board, n.d.), while ACT offers a free online learning tool and test practice program (ACT, n.d.). The rise of test preparation materials issued by College Board and ACT has also raised criticism. Some critics argued that previous versions of the SAT were not truly an aptitude test, because test prep companies had been coaching students on particular SAT items. College Board and other researchers conducted seven studies investigating short-term coaching and found only small and insignificant gains (Angoff & Dyer, 1971, Messick et al., 1980)7. ACT’s position was that, because the ACT tested students’ familiarity and facility with what they learned in high school, coaching could not provide the content knowledge necessary to make a difference in a student’s score. However, “familiarity with the test format, procedures, and test-taking skills,” along with reviewing content knowledge that had been learned several years before taking the test, did have an impact on a student’s score (ACT, 2009, p. 58). In any case, it is reasonable to conclude that, before truth-in-testing laws, test prep that exposed students to the type of content tested could have impacted scores, not because the students had greater knowledge or abilities, but simply because they had been exposed to the type of content to be tested.

Development of the SAT and ACT 31

Detection and Removal of Cultural Bias The lack of transparency around what was tested may have also led to perceptions that the tests were culturally biased. Some of the criticism may have been due to lack of information about the tests, particularly as reports regarding validity were not widely available. Other criticism may have been in part due to score gaps noted between majority and minority students (Fallows, 1980; Weber, 1974; The Educational Testing Act of 1979, 1980). In order to address these concerns, throughout the 1970s and 1980s, both College Board and ACT refined their policies for reviewing items for cultural bias. By 1980, College Board had replaced its informal policies related to detecting item bias with a formal policy that included specific guidelines for test development (see Valley, 1992; Dorans, 2013; and Zwick, 2002). The guidelines provided specific instructions to item writers to include material reflecting different cultural backgrounds as well as to avoid material that may be potentially offensive. In addition, several early approaches to quantifying item bias, such as Angoff’s delta plot method, were introduced for the SAT (see Angoff, 1972); by the late 1980s, and in response to a settlement in the lawsuit Golden Rule Life Insurance Company v. Illinois Insurance Director and Education Testing Service, these approaches had been replaced with differential item functioning (DIF) analysis (see Faggen, 1987). Broadly, DIF analysis matches examinees of different subgroups (e.g., race/ ethnicity or gender) on their level of knowledge and skills and then compares the performance of different groups on test items (Dorans & Kulick, 1986; Holland & Thayer, 1988; Dorans & Holland, 1992; and Dorans, 2013). If DIF is not present, the likelihood of answering a particular item is dictated by the level of ability a student has, not any other characteristic (e.g., race/ ethnicity). College Board and ACT also conducted validity work to examine the predictive validity evidence for different racial/ethnic groups (see, e.g., Cleary, 1968; Maxey & Sawyer, 1981; and Dorans, 2013). ACT, which had conducted content and bias review panels since its inception, “engaging representatives from various minority groups to review the language and content of questions” (ACT, 2009, p. 58), added a second process in 1981 to detect potential bias using statistical reviews of data from operational items. This process, like College Board’s early approaches, was eventually replaced by DIF analysis.

Use Within High Schools: Standards, Relevance, and Accountability As the tests were redesigned to be more reflective of the high school curriculum, more states were interested in using the tests as part of their accountability systems. The shift started during the 2001 No Child Left Behind (NCLB)

32 Michelle Croft and Jonathan J. Beard

reauthorization of the Elementary and Secondary Education Act (ESEA) (Public Law 107–110) but increased after the introduction of the ESEA Waiver program in 2011 and the subsequent reauthorization of ESEA, the Every Student Succeeds Act (ESSA) in 2015. NCLB required states to adopt content standards and aligned assessments in grades 3–8 and once in high school for reading and mathematics, as well as science for certain grade spans (20 U.S.C. § 1111(b)(3)). NCLB gave states discretion in terms of the rigor of the state’s content standards, but the general emphasis was on proficiency (Hoff, 2002). During this time, three states—Illinois, Maine, and Michigan—opted to use the SAT or ACT as part of the state’s high school assessment system (Camara et al., 2019). With NCLB’s emphasis on proficiency levels, there was growing dissatisfaction with its implementation. By 2011, four out of five schools were not expected to meet proficiency goals within the next year, raising concerns that many schools could not meet the law’s goal of 100 percent proficient by 2014 (Duncan, 2011; Obama, 2011). Although ESEA was due for Congressional reauthorization, there was not much movement to do so. As a stopgap, the Obama administration created the ESEA Waivers, under which qualifying states would have greater flexibility in designing their own accountability systems to enable greater student proficiency in ELA, math, and science (USED, 2016a). One of the terms states needed to meet to receive a waiver was adoption of more rigorous college- and career-ready academic standards. The change to college and career readiness expectations was partially a recognition that state standards adopted under NCLB were not sufficiently rigorous (National Center for Education Statistics, 2007; Byrd Carmichael, Wilson, Porter-Magee, & Martino, 2010). It was also an acknowledgment of an effort led by the National Governors Association and the Council of Chief State School Officers to develop common, research- and evidence-based content standards that would help prepare students for college and careers (Common Core State Standards Initiative, n.d.). The Common Core State Standards working group included experts from both College Board and ACT to provide information on postsecondary readiness (National Governors Association & The Council of Chief State School Officers, 2009). At its peak, all but four states adopted the Common Core State Standards (Ujifusa, 2015a), and although a number of states would subsequently change their standards, the changes have largely been minor (Friedberg et al., 2018). Because of the ESEA Waivers and state adoption of the Common Core State Standards, the majority of states would eventually have similar standards that emphasized college and career readiness. In addition to changing academic standards, there was also mounting political pressure to reduce student testing time. After the Common Core State Standards were developed, the Obama administration funded the development of tests aligned to the standards (US Department of Education, 2010). Two state-led consortia emerged in response to the funding opportunity, and each eventually

Development of the SAT and ACT 33

developed its own test: The Partnership for Assessment of Readiness for College and Careers (PARCC) assessment, and the Smarter Balanced Assessment Consortium (Smarter Balanced) assessment. Both of the tests had significantly longer administration times than previous state tests (Doorey & Polikoff, 2016, p. 29; Camara et al., 2019). In high school in particular, test length was an issue because students were also preparing for other tests, such as college admissions tests and Advanced Placement tests, that more directly affected their lives after high school. After the introduction of the two tests, there was a rise in the number of parents opting their children out of taking them, particularly at the high school level (see, e.g., Bennett, 2016; Croft, 2015; Golden & Webster, 2019). A number of states received letters from the US Department of Education about their low student participation rate in statewide testing, which could impact their ability to continue receiving federal funding (Ujifusa, 2015b). Therefore, by the time ESSA was enacted, states were looking for ways to decrease testing at all grade levels, but particularly in high school. In addition to issues related to student participation in testing, there were also changes to accountability systems more generally. For example, where NCLB had been rigid in terms of criteria for identifying schools in need of improvement and the consequences and supports applied to schools once they were so identified, ESSA gave states leeway on in determining such criteria (Lyons, D’Brot, & Landl, 2017). Also, while student academic achievement remained a significant component of a state’s required accountability system, ESSA added other components, including a school quality and/or student success indicator. Further, the law explicitly allowed districts—with state approval—to administer a locally selected, nationally-recognized test such as the SAT or ACT in place of the state’s chosen high school academic achievement assessment (20 U.S.C. § 1111(b)(2)(h)). In some cases, the state’s chosen assessment was the SAT or ACT. Since 2001, a number of states had already been administering one of the tests to all of their eleventh graders during the school day and at no cost to the student, as a way to encourage more of them to apply to college. Under ESSA, some states chose to carry this practice forward as their academic achievement indicator to increase the quantity and quality of test participation (Camara et al, 2019; Marion, 2018) as well as to potentially improve college-going rates (Hurwitz et al, 2015; Hyman, 2017). In the 2018–2019 school year, a total of 30 states administered the SAT or ACT in eleventh grade as either a state requirement or as an optional test for which the state reimburses individual districts that choose to administer it (Croft, Vitale, & Guffy, 2019). Thirteen states used the SAT or ACT as the state’s academic achievement indicator under ESSA, and 26 states used the SAT and/or ACT under ESSA as an acceptable measure of postsecondary readiness within the school quality/student success indicator. The use of SAT and ACT scores for accountability has not been without controversy. Critics contend that the tests were not designed with accountability in mind and that they are not sufficiently aligned to state content standards to be

34 Michelle Croft and Jonathan J. Beard

appropriate for such use (Marion & Domaleski, 2019; Achieve, 2018). However, others contend that the there is sufficient validity evidence to support it (Camara et al., 2019). As of this writing, the SAT and ACT have each substantially met the US Department of Education’s peer review requirements in at least one state (Brogan, 2018; Brogan, 2019). However, additional alignment evidence will still be needed before each test can fully meet requirements.

Access for Students with Disabilities and English Learners Given the expansion of testing to all students within a state, it is increasingly important that the tests are accessible and accurately reflect what students know and can do. Federal law has historically pushed testing companies toward expanding access for students with disabilities by providing accommodations. Prior to federal involvement, College Board and ACT were already providing accommodations such as additional time, use of a typewriter, and Braille versions of the tests for students with physical disabilities (Donlon & Angoff, 1971). However, testing organizations cautioned those who used the scores to make admissions decisions not to place substantial importance on them, because not enough research had been done to support their comparability with the scores of examinees who had taken the test without the accommodations. Instead, the test users were encouraged to weigh accommodated students’ previous academic records more heavily in their decision making than those of non-accommodated students (Donlon & Angoff, 1971; Laing & Farmer, 1984). As the research base grew, this caution—also known as flagging—was given only when the accommodation included extended time (Lewin, 2000, 2002). Accommodations were not required by federal law until enactment of the Rehabilitation Act of 1973 (20 U.S.C. § 701 et seq.), which prohibited the use of tests that have an adverse impact on persons with disabilities, including learning disabilities8 (34 C.F.R. 104.42(b)(2)). The Act required that when a test is administered to an applicant who has a handicap that impairs sensory, manual, or speaking skills, the test results accurately reflect the applicant’s aptitude or achievement level or whatever other factor the test purports to measure, rather than reflecting the applicant’s impaired sensory, manual, or speaking skills (except where those skills are factors that the test purports to measure) (34 CFR 104.42(b)(3)(i)) The Rehabilitation Act did not directly require College Board and ACT to provide appropriate accommodations themselves but, rather, regulations enacted in 1980 required postsecondary institutions receiving federal funding to ensure that the admissions tests they used provided them (34 CFR 104.3). In 1990, the

Development of the SAT and ACT 35

Americans with Disabilities Act (ADA) (42 U.S.C. § 12189; United States Department of Justice, 2014) further expanded opportunities for individuals with disabilities to access postsecondary education by continuing the accommodations requirements from the Rehabilitation Act. After passage of the ADA, there was growing concern among disability rights advocates about the practice of flagging test scores for nonstandard administrations. Critics contended that the practice “raises issues of stigma, privacy, and discrimination against disabled examinees” (Mayer, 1998). The criticism was not limited to the SAT and ACT. Eventually a lawsuit was filed against Educational Testing Service (ETS), the makers of the Graduate Management Admission Test (GMAT), when an examinee’s test scores were flagged as a nonstandard administration because of the examinee’s use of extended time and a trackball (Breimhorst v. Educational Testing Service, 2000). After a motion to dismiss was denied, ETS opted to settle the lawsuit and discontinue flagging GMAT scores as well as scores on the Graduate Records Exam and the Test of English as a Foreign Language (Sireci, 2005). Although College Board was not involved in the GMAT lawsuit, it opted to convene a “Blue Ribbon Panel on Flagging” jointly with the Breimhorst plaintiff’s attorneys in the spring of 2002 (Gregg, Mather, Shaywitz, & Sireci, 2002; Sireci, 2005). The panel consisted of six members, two of whom were psychometricians, and a nonvoting chair. The panel reviewed more than a dozen studies but could not reach consensus. In the end, a four-member majority recommended that flagging be discontinued, and College Board accepted the recommendation (Sireci, 2005). Shortly thereafter, upon reviewing its own research, ACT announced that, in fall 2003, it would no longer flag scores of students who tested with extended time (Lewin, 2002). The laws and policies discussed here pertain only to students with disabilities. But as more states began to use the SAT and ACT as part of their accountability systems, an increasing number of English learners were being required to take the tests. There was a concern that the scores of these examinees did not accurately reflect the students’ skills and knowledge but only their degree of English proficiency. However, because these students were not covered by the ADA, they were not offered accommodations. The matter of extending accommodations to English learners was complicated by the fact that being a “learner” is a temporary state and the examinee would eventually become proficient in English. As part of ESSA negotiated rulemaking, language was proposed in April 2016 related to accommodations both for English learners and students with disabilities (Title I, Part A negotiated rulemaking, U.S. Department of Education, 2016, April 19). The “equal benefits” language required states to ensure that appropriate accommodations were provided to students with disabilities and English learners such that neither group would be denied the benefits of participation in the test: namely, a college reportable score (34 C.F.R. 200.3(b)(2)(i)). The regulations were published for public comment in July 2016 and were finalized in December

36 Michelle Croft and Jonathan J. Beard

2016. Like the Rehabilitation Act, the regulations are not directly applicable to College Board and ACT; rather, in this case, they apply to any state seeking to use the scores as part of its accountability system. Soon after the regulations were released for public comment, ACT convened “a panel of external experts representing state education agencies, colleges, English learner and bilingual policy administrators from state departments of education, civil rights advocates, testing and measurement experts, and researchers” to help determine whether and what supports could be offered to English learners without violating the tested constructs and thus invalidating the college reportability of their scores (ACT, 2016; Moore, Huang, Huh, Li, & Camara, 2018). The panel recommended providing four testing supports for English learners: additional time; use of an approved word-to-word bilingual glossary; test instructions in the student’s native language; and testing in a non-distracting environment. Shortly after ACT announced the availability of English-learner supports, College Board also did so (College Board, 2016).

Discussion Societal demands and shifts in policy helped establish the role of the SAT and ACT and have ultimately contributed to their ongoing evolution and improvement. The programs have expanded their scope, and the tests are used by a broader and more diverse range of students and postsecondary institutions than before. At the same time, the test content has become more aligned with high school curricula and the content needed for college. The testing companies also established mechanisms to check for bias and made changes to provide more transparency into the testing process through the release of test forms and broader sharing of general information about scoring and predictive validity. As education policy continues to evolve, we expect that there will continue to be shifts and refinements in the programs.

Notes 1 The authors would like to thank Michael Walker, Wayne Camara, and Dan Vitale for their thoughtful reviews of an earlier draft of this chapter. 2 More detailed information regarding changes and modifications can be found in Angoff, 1971; Valley, 1992; Dorans, 2002; and Lawrence et al., 2003—although the reader should note that there are some minor inconsistencies across the three, especially between Angoff and Lawrence et al. The latter also contains an excellent summary of the included items and testing time allotted (Lawrence et al., 2003, pp. 6, 10). 3 For example, the 1994 revision introduced calculators on certain multiple-choice items and on grid-in item types for the math section. The 1994 revision also removed antonyms. In addition to changes in item type, the Math and Verbal scales were recentered in 1995 (Dorans, 2002). Similarly, in 2005 changes were made to add an essay and remove analogies. Readers are encouraged to read the original Lawrence et al. (2003) report for detailed information.

Development of the SAT and ACT 37

4 For a detailed description of the history of ACT, please see ACT, 2009. 5 In addition, Colorado, Florida, Hawaii, Maryland, Ohio, Pennsylvania, and Texas proposed truth-in-testing bills, and there were two proposed federal truth-in-testing bills (Robertson, 1980). 6 The lawsuit was based on a similar lawsuit by the American Association of Medical Colleges (928 F.2d 519, 2nd Cir.), which was filed in 1979 and took nearly ten years of litigation before the court ruled for the plaintiff. 7 See Messick et al. for an excellent summary of additional studies. 8 In the 1985–1986 school year, 75 percent of the 5,000 students who received accommodations on the ACT had been classified as learning disabled (ACT, 2009, p. 60).

References Achieve (2018, March). What Gets Tested Gets Taught: Cautions for using college admissions tests in state accountability systems. Achieve. https://www.achieve.org/files/CollegeAdm issionsExamBrief2018.pdf. ACT (2009). ACT: The First Fifty Years, 1959–2009. ACT, Inc. ACT (n.d.). Free ACT Test Prep. https://www.act.org/content/act/en/products-and-ser vices/the-act/test-preparation/free-act-test-prep.html. ACT (2016). ACT National Curriculum Survey 2016. ACT, Inc. http://www.act.org/con tent/dam/act/unsecured/documents/NCS_Report_Web.pdf. ACT. (2018, Nov. 6). ACT to add February ACT test date in New York state. ACT Newsroom & Blog. http://leadershipblog.act.org/2018/11/act-to-add-february-act-test-da te-in.html. The Americans with Disabilities Act of 1990, Pub. Law 101–336, 42 U.S.C. § 12101 et seq. Angoff, W. H. (1972, September). A technique for the investigation of cultural differences. Paper presented at the meeting of the American Psychological Association, Honolulu, HI. https:// files.eric.ed.gov/fulltext/ED069686.pdf. Angoff, W. H., & Dyer, H. S. (1971). The Admissions Testing Program. In W. H. Angoff (Ed.), The College Board Admissions Testing Program: A technical report on research and development activities relating to the Scholastic Aptitude Test and Achievement Tests (pp. 1–13). College Entrance Examination Board. Bennett, R. E. (2016, April). Opt out: An examination of issues. ETS Research Report Series. https://doi.org/10.1002/ets2.12101. Breimhorst v. Educational Testing Service, C-99–3387 WHO (N.D. Cal. 2000) Brogan, F. (2018). Letter to Superintendent Evers. https://www2.ed.gov/admins/lead/a ccount/nclbfinalassess/wi7.pdf. Brogan, F. (2019). Letter to Superintendent Rice. https://www2.ed.gov/admins/lead/a ccount/nclbfinalassess/michigan9.pdf. Byrd Carmichael, S., Wilson, S. W., Porter-Magee, K., & Martino, G. (2010, July 21). The State of State Standards—and the Common Core—in 2010. Thomas B. Fordham Institute. https://fordhaminstitute.org/national/research/state-state-standards-and-comm on-core-2010. Cal Ed. Code § 99150–99160. Camara, W., Mattern, K., Croft, M., Vispoel, S., & Nichols, P. (2019). Validity argument in support of the use of college admissions test scores for federal accountability. Educational Measurement: Issues and Practice, 38(4), 12–26. Cleary, T. A. (1968). Test bias: Prediction of grades of Negro and White students in integrated colleges. Journal of Educational Measurement, 5(2), 115–124.

38 Michelle Croft and Jonathan J. Beard

Coffman, W. E. (1971). The Achievement Tests. In W. H. Angoff (Ed.), The College Board Admissions Testing Program: A technical report on research and development activities relating to the Scholastic Aptitude Test and Achievement Tests (pp. 49–77). College Entrance Examination Board. College Board (2015a). Compare SAT Specifications. College Board. https://collegereadiness. collegeboard.org/sat/inside-the-test/compare-old-new-specifications. College Board (2015b). Test Specifications for the Redesigned SAT. College Board https:// collegereadiness.collegeboard.org/pdf/test-specifications-redesigned-sat-1.pdf. College Board (2016, Dec. 5). Press release: College Board simplifies request process for test accommodations. https://www.collegeboard.org/node/24406. College Board (n.d.) SAT practice on Kahn Academy. https://collegereadiness.collegeboa rd.org/sat/practice/khan-academy. College Entrance Examination Bd. v. Pataki, 893 F.Supp. 152 (N.Y.N.D. 1995) College Entrance Examination Bd. v. Pataki, 889 F.Supp. 554 (N.D.N.Y, 1995) Common Core State Standards Initiative (n.d.) Development process. http://www.coresta ndards.org/about-the-standards/development-process/. Croft, M. (2015). Opt-outs: What is lost when students do not test. ACT. https://www.act.org/ content/dam/act/unsecured/documents/5087_Issue_Brief_Opt_Outs_Web_Secured.pdf. Croft, M., Vitale, D., & Guffy, G. (2019, April). Models of using college entrance examinations for accountability. Paper presented at the National Council on Measurement in Education Annual Meeting, Toronto, ON. DerSimonian, R., & Laird, N. M. (1983). Evaluating the effect of coaching on SAT scores: A meta-analysis. Harvard Educational Review, 53(1), 1–15. Donlon, T. F. (1984). The College Board Handbook for the Scholastic Aptitude Test and Achievement Tests. College Entrance Examination Board Publishing. Donlon, T. F., & Angoff, W. H. (1971). The Scholastic Aptitude Test. In W. H. Angoff (Ed.), The College Board Admissions Testing Program: A technical report on research and development activities relating to the Scholastic Aptitude Test and Achievement Tests (pp. 15–47). College Entrance Examination Board. Doorey, N., & Polikoff, M. (2016, February). Evaluating the Content and Equality of Next Generation Assessments. Thomas B. Fordham Institute, https://files.eric.ed.gov/fulltext/ ED565742.pdf. Dorans, N. J. (2002). Recentering and realigning the SAT score distributions: How and why. Journal of Educational Measurement, 39(1), 59–84. Dorans, N. J. (2013). ETS contributions to the quantitative assessment of item, test, and score fairness. ETS R&D Scientific and Policy Contributions Series (ETS SPC-13–04). http s://www.ets.org/Media/Research/pdf/RR-13-27.pdf. Dorans, N. J., & Holland, P. W. (1992). DIF Detection and Description: Mantel-Haenszel and standardization. (ETS Research Report RR-92–10). Educational Testing Service. Dorans, N. J., & Kulick, E. (1986). Demonstrating the utility of the standardized approach to assessing unexpected differential item performance on the Scholastic Aptitude Test. Journal of Educational Measurement, 23(4), 355–368. Duncan, A. (2011, March 9). Winning the Future With Education: Responsibility, reform and results. Testimony before the House Committee on Education and the Workforce. https:// www.ed.gov/news/speeches/winning-future-education-responsibility-reform-and-results. Economic Opportunity Act of 1964, Pub. L. 88–452. The Educational Testing Act of 1979 (1980): Hearings before the Subcommittee on Elementary, Secondary, and Vocational Education of the Committee on Education and Labor, House of Representatives, 96th Congress.

Development of the SAT and ACT 39

The Every Student Succeeds Act, 20 U.S.C. § 1111(b)(3). Faggen, J. (1987). Golden Rule revisited: Introduction. Educational Measurement; Issues and Practice, 6(2), 5–8. Fallows, J. (1980). The Tests and the “brightest”: How fair are The College Boards? The Atlantic Monthly, 245(2), 37–48. Fiske, E. B. (1981, March 27). “Truth in testing” to be nationwide. New York Times, Section A, p. 1. https://www.nytimes.com/1981/03/27/nyregion/truth-in-testing-tobe-nationwide.html. Fremer, J. and Chandler, M. O. (1971). Special Studies. In W. H. Angoff (Ed.), The College Board Admissions Testing Program: A technical report on research and development activities relating to the Scholastic Aptitude Test and Achievement Tests (pp. 147–178). College Entrance Examination Board. Friedberg, S., Barone, D., Belding, J., Chen, A., Dixon, L., Fennell, F. S., Fisher, D., Frey, N., Howe, R., & Shanahan, T. (2018). The State of State Standards Post-Common Core. Thomas B. Fordham Institute. https://fordhaminstitute.org/national/research/state-sta te-standards-post-common-core. Gladwell, M. (2001, Dec. 10). Examined life. The New Yorker. https://www.newyorker. com/magazine/2001/12/17/examined-life. Golden E., & Webster, M. (2019, Sept. 14). More Minnesota students opting out of state tests. Minnesota Star Tribune. http://www.startribune.com/more-minnesota-students-op ting-out-of-state-tests/560350012/. Gregg, N., Mather, N., Shaywitz, S., & Sireci, S. (2002). The Flagging Test Scores of Individuals With Disabilities Who Are Granted the Accommodation of Extended Time: A report of the majority opinion of the Blue Ribbon Panel on Flagging. https://dralegal.org/wp-content/up loads/2012/09/majorityreport.pdf. Haney, W. M., Madaus, G. F., & Lyons, R. (1993). The Fractured Marketplace for Standardized Testing. Springer Netherlands. Hoff, D. (2002, Oct. 9). States revise the meaning of “proficient”, Education Week. https:// www.edweek.org/ew/articles/2002/10/09/06tests.h22.html. Holland, P. W., & Thayer, D. T. (1988). Differential item performance and the MantelHaenszel procedure. In H. Wainer & H. Braun (Eds.), Test Validity (pp. 129–145). Erlbaum. Hurwitz, M., Smith, J., Niu, S., & Howell, J. (2015). The Maine question: How is 4-year college enrollment affected by mandatory college entrance exams? Educational Evaluation and Policy Analysis, 37(1), 138–159. http://journals.sagepub.com/doi/full/10.3102/ 0162373714521866. Hyman, J. (2017). ACT for all: The effect of mandatory college entrance exams on postsecondary attainment and choice. Education Finance and Policy, 12, 281–311. http:// www.mitpressjournals.org/doi/full/10.1162/EDFP_a_00206. Jackson, R. (1980). The Scholastic Aptitude Test: A response to Slack and Porter’s “critical appraisal.” Harvard Educational Review, 50(3), 382–391. Laing, J., & Farmer, M. (1984). Use of the ACT assessment by examinees with disabilities. ACT Research Report Series no. 84. ACT. https://www.act.org/content/dam/act/unse cured/documents/ACT_RR84.pdf. Lawrence, I. M., Rigol, G. W., Van Essen, T., & Jackson, C. A. (2003). A Historical Perspective on the Content of the SAT (ETS RR-03–10). College Entrance Exam Board. Lazerson, M. (1998). The disappointments of success: Higher education after World War II. The Annals of the American Academy of Political and Social Science, 559, 64–76. www. jstor.org/stable/1049607.

40 Michelle Croft and Jonathan J. Beard

Lewin, T. (2000, April 11). Disabled student is suing over test-score labeling. The New York Times Section 1, page 18. Lewin, T. (2002, July 28). ACT ends flags on test scores of the disabled. The New York Times, Section 1, p. 16. https://www.nytimes.com/2002/07/28/us/act-ends-flags-ontest-scores-of-the-disabled.html. Lindquist E. F. (1976, Oct. 5). Interview by James Beilman with Dr. E. F. Lindquist. University of Iowa Oral History Project. Lyons, S., D’Brot, J., & Landl, E. (2017). State Systems of Identification and Support Under ESSA: A focus on designing and revising systems of school identification. Washington, DC: Council of Chief State School Officers. https://ccsso.org/sites/default/files/2017-12/Sta te%20Systems%20of%20ID%20and%20Support%20-%20Designing%20and%20Revising %20Systems_0.pdf. Marion, S. (2018). What’s Wrong With High School Testing and What Can We Do About It? National Center for the Improvement of Educational Assessment. https://files.eric.ed. gov/fulltext/ED587495.pdf. Marion, S., & Domaleski, C. (2019). An argument in search of evidence: A response to Camara et al. Educational Measurement: Issues and Practice, 38(4). 27–28. Maxey, J., & Sawyer, R. (1981). Predictive validity of the ACT assessment for AfroAmerican/Black, Mexican-American/Chicano, and Caucasian-American/White students. ACT Research Bulletin, ACT Archive. Mayer, K. S. (1998). Flagging nonstandard test scores in admissions to institutions of higher education. Stanford Law Review, 50, 469–522. Messick, S., Alderman, D.L., Angoff, W., Jungeblut, A., Powers, D. E., Rock, D., Rubin, D. B., Stroud, T. W. F. (1980). The effectiveness of coaching for the SAT: Review and Reanalysis of Research from the Fifties to the FTC. (ETS RR-80–08). https://www. ets.org/Media/Research/pdf/RR-80-08.pdf. Moore, J., Huang, C., Huh, N., Tianli, L., & Camara, W. (2018). Testing supports for English learners: A literature review and preliminary ACT research findings (ACT Working Paper 2018–2011). ACT, Inc. https://eric.ed.gov/?id=ED593176. National Center for Education Statistics (2007). Mapping 2005 State Proficiency Standards Onto the NAEP Scales (NCES 2007–2482). US Department of Education. National Center for Education Statistics. National Defense Education Act (NDEA), Pub. L. 85–864 (1958). National Governors Association and The Council of Chief State School Officers (2009). Common Core State Standards Initiative K–12 Standards Development Teams. http://www. corestandards.org/assets/CCSSI_K-12_dev-team.pdf. An act to amend the education law, in relation to standardized test administration, New York S.B. 8639 (2018). New York Educ. Code § 341, 342. The No Child Left Behind (NCLB) reauthorization of the Elementary and Secondary Education Act (ESEA), Pub. L. 107–110 (2001). Obama, B. (2011, March 14). Remarks by the President on education. Kenmore Middle School, Arlington, Virginia. Retrieved on January 30, 2020. https://obamawhitehouse. archives.gov/the-press-office/2011/03/14/remarks-president-education-arlington-virginia. The Rehabilitation Act of 1973, Pub. Law 93–112, 20 U.S.C. § 701 et seq. Robertson, D. F. (1980). Examining the examiners: The trend toward truth in testing. Journal of Law & Education, 9, 183–193. Servicemen’s Readjustment Act, Pub. L. 78–346.

Development of the SAT and ACT 41

Sireci, S. G. (2005). Unlabeling the disabled: A perspective on flagging scores from accommodated test administrations. Educational Researcher, 34(1), 3–12. https://doi.org/ 10.3102/0013189X034001003. Slack, W. V., & Porter, D. (1980a). The Scholastic Aptitude Test: A critical appraisal. Harvard Educational Review, 50(2), 154–175. Slack, W. V., & Porter, D. (1980b). Training, validity, and the issue of aptitude: A reply to Jackson. Harvard Educational Review, 50(3), 392–401. Trachsel, M. (1992). Institutionalizing Literacy: The historical role of college entrance examinations in English. Southern Illinois University Press. Ujifusa, A. (2015a, June 29). Map: Tracking the Common Core State Standards. Education Week. https://www.edweek.org/ew/section/multimedia/map-states-academic-standardscommon-core-or.html. Ujifusa, A. (2015b, Dec. 23). Education department asks 13 states to address low test-participation rates. Education Week. http://blogs.edweek.org/edweek/campaign-k-12/ 2015/12/twelve_states_asked_to_address.html. US Department of Education (2010). Overview information: Race to the top fund assessment program; Notice inviting applications for new awards for fiscal year (FY) 2010. Federal Register, 72 no. 68. April 9, 2010. https://www.govinfo.gov/content/ pkg/FR-2010-04-09/pdf/2010-8176.pdf. US Department of Education (2016b, April 19) Title I, Part A negotiated rulemaking, https:// www2.ed.gov/policy/elsec/leg/essa/title1a-assessment-consensus-regulatory-lang.pdf. US Department of Education (2016a, May 12). ESEA flexibility. https://www2.ed. gov/policy/elsec/guid/esea-flexibility/index.html. United States Department of Justice (2014). ADA requirements: Testing accommodations. Technical assistance. https://www.ada.gov/regs2014/testing_accommodations.pdf. Valley, J. R. (1992). The SAT: Four major modifications of the 1970–85 era. College Board. Weber, G. (1974). Uses and abuses of standardized testing in schools (Occasional Papers, No. 22). Council for Basic Education. Zwick, R. (2002). Fair Game? The use of standardized admission in higher education. RoutledgeFalmer.

3 THE HISTORY OF NORM- AND CRITERION-REFERENCED TESTING1 Kurt F. Geisinger

The difference between norm-referenced and criterion-referenced testing relates to the frame of reference used in interpreting scores. Norm- and criterion-referenced tests may be developed and used differently as well. These two terms are used almost universally in regard to educational testing (as opposed to industrial testing or clinical assessment). In fact, the usage of these terms is almost exclusively tied to achievement testing and the assessment of students or trainees, and very rarely in industrial training contexts. Their two histories provide considerable understanding of their similarities and differences. Because formal use of the term norm-referenced testing began approximately 100 years ago and the term criterion-referenced testing, some 50 years ago, the history of norm-referenced testing is discussed first. It should be clear, however, that there were norm- and criterion-referenced tests before these terms became common. From this point on, the terms CRT and NRT are used to represent the two approaches to testing (DuBois, 1970).

Norm-referenced Testing Early Beginnings: Assessment of Intelligence Modern psychological and educational testing is often thought to have begun in the late 19th and early 20th centuries. There are several individuals who propelled assessment forward. One of these was Wilhelm Wundt, who founded the very first psychology laboratory at the University of Leipzig. Wundt became especially well known for his work with experimental control in which all aspects of a psychological experiment are controlled in order to show clearly that the independent variable affected the dependent variable. Controlling all aspects of test administration so that only individual differences of test takers were

Norm- and Criterion-Referenced Testing 43

responsible for differences in tested performance became a critical component of test standardization. Test standardization was widely adopted, including by one of Wundt’s doctoral students, James McKeen Cattell, an American, who later began his career working in Francis Galton’s laboratory in England. According to DuBois (1970), Cattell began building psychological measures while working with Galton in Cambridge, England, and continued when he returned to the U.S. a few years later. In 1890, he became professor of psychology at the University of Pennsylvania and moved to Columbia University three years later. The psychological measures he built during these early days of psychology, however, were mostly psychophysical, and included such measures as dynamometric pressure as well as vision tests, hearing tests, taste and smell measures, touch and temperature measures, and others (Cattell, 1890), akin to those studied by Galton. After Cattell went to Columbia, he began using somewhat more mental tests (Cattell & Farrand, 1896). Subsequently, one of his students (Wissler, 1901) produced a validity study correlating these measures with college grades, and the results were extremely disappointing. About the same time as Cattell’s early work, Alfred Binet and one of his colleagues, Victor Henri, also a student of Wundt, began working on developing measures of applied thinking ability. Binet had been charged with helping Parisian public schools deal with the diversity of students. Henri brought Wundt’s approach to test administration for the Binet and Henri tests (Binet & Henri, 1896). Binet and his colleagues decided that they needed to assess higher order or complex thinking processes and used methods that assessed memory span-type questions. Several researchers at the time found that brighter students were more able to perform these memory span tests successfully than students with developmental delays. During the early 1900s, Binet studied early work in the identification of individuals with retardation and found this work useful. Working with Simon, Binet proposed that students failing to succeed should be assessed using a medico-psychological examination prior to being placed in a special education program or school. The Binet and Simon (1905) scale was essentially the first successful test of intelligence, and, in effect, the first modern (although rudimentary) norm-referenced test. Its normative orientation was quite crude at that time. It included 30 tests or subtests, all of which were administered in a highly standardized manner. In 1908, Binet included the use of the term “mental age,” a score that could characterize a child’s performance. In the next few years, the German psychologist William Stern developed the concept of intelligence as the ratio of mental age divided by chronological age (1911), a value that was later multiplied by 100 by Lewis Terman to become the intelligence quotient. After the publication of the 1908 Binet test of intelligence, it was widely accepted as a measure (DuBois, 1970), and was translated and adapted to American uses by several psychologists, such as Henry Goddard (1908, 1910) and Terman (1906, 1916). Once Terman published the Stanford-Binet, he began using the intelligence quotient to separate children by age and by

44 Kurt F. Geisinger

“intellectual group,” which were defined as less than 90, 90–110, and above 110. This approach would appear to be another step toward full-fledged NRT. Prior to World War I, all formal cognitive testing was individually administered. G. M. Whipple (1910) was one of the first who publicly called for the economy of group testing, while still recognizing advantages of individual administration. Arthur Otis, a Terman student at Stanford, was working on group tests of intelligence that could be administered to multiple individuals simultaneously. A team of many of the most important testing psychologists in the country worked together and developed the Army Alpha group intelligence test in less than 6 months (Yoakum & Yerkes, 1920; see also Bunch, Ch. 4, this volume). Without the successful development of these intelligence tests, it is not clear that group achievement testing would have followed. A revision of the Stanford-Binet was published in 1937. To challenge the popular Stanford-Binet, David Wechsler developed the Wechsler-Bellevue test of adult intelligence (1939). While the most noteworthy change from the StanfordBinet was the addition of performance tests, more critical to the present discussion was that the Wechsler-Bellevue no longer used the intelligence quotient based on mental and chronological age and instead used age-based standard scores with a mean of 100 and a standard deviation of 15, and hence, moved to a fully NRT approach.

Achievement Testing The first serious developer of achievement testing in the United States was probably Joseph Mayer Rice. A physician, he re-trained in educational research methods of the late 1880s. He studied such subjects as spelling, language, and arithmetic, often testing thousands of students himself. His goal was not assessment, but rather increasing the efficiency and efficacy of educational processes. To accomplish such recommendations, however, he needed measurements. He studied how school systems taught spelling (1897), arithmetic (1902), and language (1903), and E. L. Thorndike later cited these studies as influential on his work in educational assessment. These assessments moved educational research toward standardized tests (DuBois, 1970). Thorndike and his students at Columbia University began a systematic study of ways to best assess a variety of educational constructs, including handwriting and arithmetic. Thorndike’s (1904) volume introducing measurement in the psychological and social sciences laid out an academic discipline, testing, in its infancy, but with many features that continue to be important today. He discussed the pre-testing of items, units and scales of measurement, the calculation of means and variances, graphing the distribution of scores, probabilistic interpretations of scores, the calculation and interpretation of correlations as a way to assess the degree of relationships, and the notion of reliability of measurement. Clearly, it provides many of the fundamentals of norm-referenced testing.

Norm- and Criterion-Referenced Testing 45

Thorndike (1910) also developed an 11-point scale (graphometer) by using several dozen competent judges to evaluate differences in the legibility of handwriting for students in grades 5–8. In evaluating handwriting, judges could consider examples for each point along the scale and attempt to decide which value a student’s handwriting most closely matched. (This approach may actually be an example of an early criterion-referenced scale.) In fact, the methodology he used parallels the careful development of a behaviorally anchored rating scale as is used in applied work today (e.g., Schwab et al., 1975; Smith & Kendall, 1963). Thorndike (1914) later published a test of reading words and sentences. Daniel Starch (1916) published a textbook that summarized the state of educational measurement and catalyzed considerable future work in educational testing and assessment. In general, the book was broken down by various academic subjects (e.g., reading, English grammar, writing, spelling, arithmetic), and it presented actual published and experimental tests in those subjects. There was no evaluation of the instruments per se, but the book was used as a reference. Neither was any statistical, psychometric, or test theory-type information provided. About this time, the previously mentioned development of Army Alpha brought about the use of multiple-choice items or questions, which also spurred work on test development, group testing, and test scoring. Subsequent to the publication of Starch’s book, the work of Thorndike, and the development of the multiple-choice item, there was an explosion of standardized educational tests in many fields as documented by J. M. Lee (1936) in terms of student achievement in secondary school subjects. One of Thorndike’s students who had worked on the Army Alpha was Benjamin D. Wood, who demonstrated that using objective items led to more reliably scored tests and measures that were better able to assess more of a content domain (1923). Wood believed that education was much too disorganized, placed too much emphasis on the freedom of teachers to teach whatever they wanted, and that one important way to systematize education was through the use of achievement testing. In this way he foretold efforts like “No Child Left Behind” as attempts to establish accountability (Lemann, 1999). He became a leader in the development of standardized public and higher education achievement testing. Such tests were built after item pre-testing, the development of standardized test administration instructions, and the development of norms. They were among the first professionally developed norm-referenced achievement tests. In the 1920s K–12 student achievement tests became available through the California Test Bureau (CTB), the World Book Company (which published the Stanford Achievement Test) and the University of Iowa (which published the Iowa Every Pupil Examination) (Academic Excellence, n.d.; Swafford, 2007). CTB has undergone several corporate changes since its early years. The first tests published by CTB dealt with student mathematical achievement. These first tests by CTB were followed by the Progressive Achievement Tests, which were later named the California Achievement Tests. Today, CTB is a division of Data

46 Kurt F. Geisinger

Recognition Corporation (DRC, n. d.) and publishes the TerraNova 3, which is a nationally normed achievement test for students in grades K–12 (California Test Bureau, 2012). After his success with the Stanford-Binet, Terman was heavily involved in developing the Stanford Achievement Tests (Kelley, Ruch, & Terman, 1922), a set of graded tests still in use today. A notable difference between the StanfordBinet Intelligence tests and the Stanford Achievement Tests is that rather than norming the tests on each chronological age as does the Stanford-Binet, the Stanford Achievement Tests were normed on academic grade levels. Students received grade-equivalent scores rather than age-equivalent scores so that students and their parents could see how they compared to other students within a national norm group for that grade. These tests measure developed academic skills such as reading comprehension, mathematical skills, and science knowledge, from kindergarten to 12th grade. The Stanford Achievement Tests currently are published by Pearson Assessments, Inc. and are group-administered NRTs. Because of their norms, they can be clearly identified as NRTs, and are comparable to other such student achievement tests. The primary use of such tests includes seeing how students compare to each other in terms of their learning; making predictions of future learning; and selecting students for special programs, scholarships, and the like. The Iowa Every Pupil Examination was primarily developed by E. F. Lindquist and was designed solely for use in Iowa (Swafford, 2007). In the mid-1930s, through the Bureau of Educational Research, the Iowa Testing Program materials became available nationwide. However, it was through a partnership with Houghton Mifflin in 1940 that more widespread use of the Iowa Testing Program materials was generated. These are now known as the Iowa Assessments, which are published by Riverside Insights (Swafford, 2007). These early achievement tests were used by schools and school systems in order to provide student achievement information as well as classroom, school, and district level achievement. This initial beginning of K–12 achievement tests led to the use of specialized tests, such as the Stanford-Binet, for placement purposes for students who would benefit from specialized education and for college admissions testing. Many states now build their own tests (or have them built by contractors), and the use of some published NRT achievement measures has generally been reduced. NRTs that may have the greatest public recognition are those for college admissions, specifically the SAT and ACT (see also Croft & Beard, Ch. 2, this volume). The SAT, first known as the Scholastic Aptitude Test (and later the Scholastic Assessment Test) essentially evolved from the Army Alpha intelligence test developed during First World War. A young psychologist, Carl Brigham, an assistant on the development of the Alpha, taught at Princeton. At Princeton, he worked to adapt Alpha to a college admissions test. It was first administered experimentally to a few thousand college applicants in 1926. About this time, the

Norm- and Criterion-Referenced Testing 47

Ivy League colleges began to become interested in attracting students from public schools, not only from their traditional private preparatory school feeder schools. James Bryant Conant, the president of Harvard, assigned Henry Chauncey, an assistant dean at Harvard, to find a new college admissions test to use to accept these new students and to provide scholarships to the worthiest candidates. Chauncey met Carl Brigham and came back to Conant with the recommendation to use the SAT. Conant liked the idea of using a test of intelligence (Lemann, 1999). In the late 1930s Chauncey convinced other member schools of the College Board to use the SAT for scholarships, and then in 1942, to do so for admissions. In 1948 the Educational Testing Service was chartered as a not-forprofit testing organization by the American Council on Education, the Carnegie Foundation for the Advancement of Teaching, and the College Board (then the College Entrance Examination Board). Beginning in 1941 and continuing until 1995, normative scores on the measure were based on a 1941 fixed reference normative group and in the middle 1990s a new reference group was developed and scores re-centered (Anastasi & Urbina 1997). The ACT was instituted in 1959 in Iowa City, Iowa to provide an alternative college admission measure. Although the SAT was built on an intelligence test model, the ACT was rather built as an achievement test. Both the SAT and the ACT today measure what many term “developed academic abilities”. Both tests provide students and other test users with scaled scores rather than raw scores. The SAT is composed of two scores for Verbal and Quantitative. The ACT has four subtests (English, Mathematics, Reading, and Science Reasoning). The most common interpretation for students and parents, however, is the national percentile rank. That is, if a score of 24 on the ACT would be equivalent to a percentile rank of about 74, that means a student scoring 24 on a composite ACT score would have scored better than some 74% of the students taking the test nationally. In this way, students and others can interpret their scores with relative ease.

Norms and Norm Samples A critical piece of any norm-referenced test is the norm group upon which the norms of performance are based. There are generally two criteria against which all norm groups are based: representativeness and size. Representativeness relates to how well the sample approximates the population. Typically, for tests in the United States, test publishers attempt to have their norm groups balance percentages of representation as closely as possible to the U.S. Census. If a test is appropriate for children at different ages or grades in school, the sample must be broken down into such segments and those samples also need to be representative. If the sample is representative, then a larger sample is desired in order to reduce error. If a norm group is not representative, making it large, however, may not be very useful. Potential test users should determine how representative

48 Kurt F. Geisinger

the norm group used for a particular test is to the population for which it will be used and determine how well the norm sample represents its intended population. There have been some cases where norm groups to justify a test have not been appropriate. In the late 1960s, the Head Start program of early childhood education, aimed especially at pupils likely to have some educational difficulties, was evaluated. The program was initially evaluated using the Peabody Picture Vocabulary Test as part of the evaluative evidence. Although there were reasons why this instrument was selected, it had been normed and validated using almost exclusively white children whose parents were college professors in one city in the United States (Geisinger, 2005). Such a group differed substantially from the population for which the test was intended. In such a situation, one must question the appropriateness of its use. Another test, the Hiskey-Nebraska Test of Learning Aptitude (Hiskey, 1966) was developed as a measure to be used with children who were hearing impaired. It was initially normed on a group of children who were hearing but who were told to feign being deaf. It was subsequently normed on a more appropriate group. This first norming occurrence for the Hiskey-Nebraska occurred some 60 years ago and, hopefully, a study like that would not be performed in the 21st century. Nevertheless, the significance of these issues is such that they remain examples of what not to do. (These scenarios are also described in Geisinger, 2005)

The Normal Curve Many observers of educational and psychological testing believe that many or all test scores fall along a normal curve, a theoretical distribution that is symmetrical with many data points at and around the mean of the distribution and with decreasing frequencies as one approaches either of the distribution tails. The normal curve follows a mathematical formula. Because the normal curve has a defined shape, it also has fixed proportions of data points falling below specific sections of the curve. Such a distribution is quite useful in education and psychology because test score distributions that follow the normal curve can be quickly interpreted in terms of where they fall in the distribution. Magnusson (1967) stated that many human characteristics such as height, weight and body temperature are well represented by the normal curve. He also asserted that following a normal curve helped various measurements to be considered on an interval rather than an ordinal scale. Anastasi (1958) also noted that many other human characteristics seem to fall along a normal distribution: lung capacity, autonomic balance, perceptual speed and accuracy, and intelligence scores. Therefore, it seems helpful when educational test scores can be fit to a normal curve without excessive fitting. It is a relatively simple process to fit the scores to a normal curve. Nevertheless, while the scores resulting from many norm-referenced tests may follow a normal curve, they need not do so.

Norm- and Criterion-Referenced Testing 49

Developing NRTs There are two overall methods for developing NRTs, classical test theory (CTT; cf. Clauser, Ch. 8, this volume) and item response theory (IRT; cf. Luecht & Hambleton, Ch. 11, this volume). A goal of any norm-referenced measure is to spread out the scores of those taking the test in order to differentiate among test takers. In CTT, two or three actions are necessary to do so (Davis, 1952; Green, 1954; Henryssen, 1971; Thurstone, 1932). When using CTT we want each individual test item to have close to its maximal possible variance, which is in the middle of the distribution. For dichotomously scored items such as multiplechoice items, their item difficulties should be near 0.5, so that about one half of the test takers get each item correct. For essay prompts, the averages should still fall in the middle or near the middle of the possible distribution, again to maximize the item’s variance. The second component in CTT is trying to select items that lead to maximal internal consistency reliability, which also increases the variance of a test’s overall scores. This method requires selecting the items with the highest correlations with the total test score to increase internal consistency in terms of Coefficient Alpha or KR-20. A test with high internal consistency reliability typically contains items that measure the same or highly similar constructs, which does not necessarily mean the instrument measures a single construct. A high reliability also leads to the highest differentiation among test takers. Two such correlations, known as item discrimination indices, are most common: the point-biserial and the biserial. These indices inform the test developer the extent to which the item appears to validly differentiate people in terms of the underlying construct. While CTT has been used successfully for many years to develop NRTs, it has some shortcomings. With CTT a person’s achievement is characterized by the total score on the test and the mean and standard deviation of the score distribution (Price, 2017). Thus, a person’s score becomes dependent on the sample used to create the norms. Many of today’s NRTs, such as the Graduate Record Examination, use IRT methods to create the test. IRT is a probabilistic method in which an individual’s test score signifies the probability of an examinee getting an item correct based upon her/his ability and the item level difficulty (Price, 2017). Therefore, a person’s score is expected to be no longer sample dependent. Test questions (items) are generated in such a way that they form a continuum of the construct being measured, from low to high difficulty, which signifies level of achievement. People with higher achievement are more likely to correctly answer items of greater difficulty (Price, 2017). There are various IRT models, and depending upon the model used, there continue to be indices of item difficulty and item discrimination, as well as chance probabilities of answering a question correctly. Another advantage of tests developed using IRT methods is that tests can be specific to an individual through

50 Kurt F. Geisinger

computer adaptive testing (CAT). Many modern standardized tests (e.g. the Graduate Record Examination) take advantage of this IRT feature by using computer adaptive tests.

Validation In the early days of testing, it was often stated that tests are valid to the extent that they accomplish their purpose(s). Increased variability and maximal reliability were considered critical for norm-referenced measures because many norm-referenced measures were evaluated by the correlations that they had with other relevant variables, that is, their ability to predict future behavior. Karl Pearson published the formula for the correlation coefficient, which became critical in determining whether tests were empirically valid, in 1896. J. P. Guilford (1946) expressed the importance of criterion-related validity by stating, “a test is valid for anything with which it correlates” (p. 429), a comment that was endorsed by other early leaders of the testing community (e.g., Bingham, 1937; Kelley, 1927; Thurstone, 1932). What we now consider criterion-related validity or criterion-related evidence of validity was initially the sole type of validity. It was so important that Cronbach and Meehl (1955) stated that construct validity was appropriate for those situations where useful criteria to be predicted were not available. More recently, the concept of validity has widened to include evaluation of the appropriateness, usefulness, and meaningfulness of test scores (Messick, 1998). Adding to Messick’s well recognized perspectives, Michael Kane (2006; Kane & Bridgman, Ch. 9, this volume) proposes the use of different criteria to determine validity based upon the assumptions made and the inferences drawn from test scores, while acknowledging that content-related evidence is essential. An interpretive argument for the use of test scores is developed and analyzed to determine if the interpretive argument includes the evidence for the stated assumptions.

Scaling and Equating Angoff (1971) presents what many consider the most important document on norms, scaling, and equating in the history of classically developed educational measures. This chapter only acknowledges several of his points briefly. Norm groups have already been addressed. Most teacher-made tests are scored simply, typically using raw scores that are easily understood by the recipients. If the test has 10 questions, scores are often 10, 20, 30…90, 100, representing the percentages correct. If the teacher wishes to give partial credit for some answers, greater delineations of scores are possible. However, it is clear what scores mean and students and their guardians can interpret the scores directly. However, standardized test constructors have a more difficult job. For example, we know that earning 90% on a test composed of easy questions is not the same as earning 90% on a test composed of very difficult questions.

Norm- and Criterion-Referenced Testing 51

Almost every standardized, norm-referenced educational test employs a scale other than the raw score. Many of these scales follow specific numerical systems that become well known to test users. A great number of tests use the T score scale, with a mean of 50 and a standard deviation of 10. Many intelligence tests use a mean of 100 and a standard deviation of 15. Moving from the raw score scale to one of these scaled score systems is a linear transformation and does not normally change the shape of the distribution – a graph of scores would look identical to the raw scores; it simply makes the scores different. Many test score distributions approximate the normal distribution even without a normalization transformation. Equating processes are beyond the scope of this chapter (cf. Kolen, Ch. 14, this volume). However, equating is most used for multi-form tests where different test forms must yield scores that are equally meaningful. Equating is a process that allows psychometricians to set scores to carry the same meaning across forms. Most of the procedures used were developed on norm-referenced measures for the following reasons. When students take the SAT or the ACT on multiple occasions, we want their scores to be equivalent. Achievement tests are often equated. States in the United States are required to test students in English Language skills and Mathematics annually from 3rd through 8th grade, and once in high school, and to test science knowledge at three grades. To determine whether annual improvements occurred requires use of equivalent scores.

Criterion-Referenced Testing While the use of criterion-referenced tests has been recorded as far back as ancient imperial China (Elman, 1991), Robert Glaser’s (1963) article is generally credited as being the first published paper that used the term, “criterion-referenced testing.” However, there were a number of precursors to his seminal article. These include the works of J. C. Flanagan, (1951), Joseph Hammock (1960a, 1960b), and Robert Ebel (1962), for example. Flanagan’s contribution was the chapter on scores and norms that appeared in the first edition of the Educational Measurement series. While arguing through most of the chapter that using scaled scores based on a fixed reference group is the best way to interpret scores, he also acknowledged that raw scores may have a special meaning for specific kinds of tests. For example, he stated, “The raw score is a very fundamental piece of information and should not be relinquished in favor of some other type of score without good reason” (p. 705). Flanagan mentioned teacher-made tests as one such illustration. Ebel (1962) cited Flanagan while calling for content standard scores and agreed that raw scores can prove quite useful in cases where we are interpreting what a student can do rather than where s/he stands in regard to his/her peers. In fact, he stated, It is unfortunate, I think, that some specialists in measuring educational achievement have seemed to imply that knowing how many of his (sic) peers

52 Kurt F. Geisinger

a student can excel is more important than knowing what he can do to excel them (pp. 17–18) He desired a type of testing that informs the testers what examinees can do as well as how they compare to their peers and reported his belief that scaled scores based on norms groups are not useful in describing what students can actually do. Ebel concluded his article with the statement, “Our purpose has been to emphasize the need for and to demonstrate the possibility of test scores which report what the examinee can do. Content-meaning in test scores supplements but does not replace normative meaning” (p. 25). Hammock’s (1960a) paper, abstracted in American Psychologist (Hammock, 1960b) is quite similar to Glaser’s classic article. Hammock was interested in instructional research and argued that norm-referenced measures built to differentiate students were not useful when conducting research to determine what instructional strategies and technologies work best to improve student learning. He held that tests built as criteria for selection research or to grade students are built with traditional item analysis techniques, presented above in this chapter, that use item difficulties/means and item discrimination indices that spread students out rather than assuring that students have learned the necessary knowledge and skills. Hammock also admitted that both CRTs and NRTs can have content validity (what we would now call content evidence of validity). For example, he makes a point later made by Popham and Husek (1969), that an item given after instruction that all students answer correctly would never be placed on a NRT but can be extremely useful on what he calls a criterion measure due to its lack of item discrimination. For the kind of criterion measure that Hammock (1960a) envisioned, items should be selected for maximum discrimination between two groups that are equated for aptitude variables but are maximally different in instructional performance. Two such groups would be (a) a group of persons who are highly experienced and competent in the area of criterion performance and (b) the same group before they received any instruction or practice in the area of performance (pp. 5–6) He suggested that the best possible item would be one that no one answers correctly prior to instruction and all answer correctly after instruction. He therefore identified that the primary difference between developing what became NRTs and CRTs are the item analysis procedures used to construct the measures. Robert Glaser (1963) provided an example of the kind of situation of concern to both Hammock and Glaser.

Norm- and Criterion-Referenced Testing 53

The scores obtained from an achievement test provide primarily two kinds of information. One is the degree to which the student has attained criterion performance, for example whether he (sic) can satisfactorily prepare an experimental report or solve certain kinds of work problems in arithmetic. The second kind of information that an achievement test score provides is the relative ordering of individuals with respect to their test performance. (p. 519) He continued stating that it is the frame of reference of comparison that constitutes the difference. He then named this new kind of testing, “What I shall call criterion-referenced measures depend upon an absolute standard of quality, while what I term norm-referenced measures depend upon a relative standard” (p. 519). As Hammock had done, Glaser stated, “Along such a continuum of attainment, a student’s score on a criterion-referenced measure provides explicit information as to what the individual can or cannot do.” Glaser continued his brief argument by agreeing that NRTs are good for correlational research; however, he then identified the problem that still haunts CRTs today: We need to behaviorally specify minimum levels of performance that describe the least amount of end-of-course competence the student is expected to attain, or that he (sic) needs in order to go on to the next course in a squence. (p. 520) He closed this classic article identifying problems that continue to be manifested in CRTs: “Many of us are beginning to recognize that the problems of assessing existing levels of competence and achievement… require some additional considerations” (p. 521). After this initial call for CRTs, research moved in numerous directions. The educational and psychometric communities were excited by the notion of CRTs, which occurred about the same time that interest in both mastery learning and instructional technology grew. In fact, early in the CRTs, there was considerable input from instructional theorists and technologists. During the 1970s and 1980s there was an explosion of studies and articles clarifying the psychometric nature and issues involved in CRT. Initially, much conversation clarified the differences between NRTs and CRTs because so much had been known about NRTs for so long. Jason Millman (1974) provided an early review of the literature. By 1978, William M. Gray identified 57 different definitions of criterion-referenced testing in the literature. Simply counting differing definitions was not particularly meaningful, however. Anthony Nitko (1980), concerned that early CRTs struggled with appropriate definitions that differed along significant dimensions or facets, provided a

54 Kurt F. Geisinger

simplified classification scheme for considering CRTs. One dimension along which all CRTs could be divided was the domain upon which the test was based. He identified CRTs as based upon well-defined, ill-defined, and undefined domains. Then he differentiated those with well-defined domains depending upon whether the learning assessed was ordinal or hierarchical, for example, learning depends on successful learning as is found in various learning hierarchies. (Other types of ordering were also possible.) Glaser and Nitko (1971) differentiated the two types of testing in the following manner: As Popham and Husek (1969) indicated, the distinction between a normreferenced test and a criterion-referenced test cannot be made by simple inspection of a particular instrument. The distinction is found by examining (a) the purpose for which the test was constructed, (b) the manner in which it was constructed, (c) the specificity of the information yielded about the domain of instructionally relevant tasks, (d) the generalizability of test performance information to the domain, and (e) the use to be made of the obtained test information. Since criterion-referenced tests are specifically designed to provide information that is directly interpretable in terms of specified performance standards, this means that performance standards must be established prior to test construction and that the purpose of testing is to assess an individual’s status with respect to these standards. (p. 654) The fact that CRTs were used long before an accepted term was posited led to the need to formally differentiate between NRTs and CRTs. (Nitko, 1980, traced the history of CRT-type tests to 1864.) James Popham (1978) stated the following, “A criterion-referenced test is used to ascertain an individual’s status with respect to a well-defined behavioral domain” (p. 93). Popham also called for objectives and skills to be carefully defined within the curriculum and instruction and items to be written to closely match those objectives. Within the framework that CRTs are designed to measure a person’s performance on tasks within a specific domain, Nitko (1980) gave useful guidelines for categorizing CRTs based upon the test, how the test was produced, and the context in which author’s discussed the test. CRTs had a role in innovations such as instructional technology and mastery learning. Hambleton, Swaminathan, Algina, and Coulson (1978) reported that they had found over 600 articles in the literature on the topic of CRTs, all since the term was named in 1963 by Glaser. Among the topics that were discussed in articles, chapters, and books were the technical requirements of those new tests, especially their reliability; how criterion-referenced tests should be built; how long criterion-referenced tests needed to be; and how to set performance standards. Each of these issues is described briefly in the following sections. The literature, however, is so voluminous, only a few articles are cited in each instance.

Norm- and Criterion-Referenced Testing 55

Reliability of Criterion-referenced Tests (CRTs) One of the first topics that appeared regarding CRTs was that of reliability, especially internal consistency reliability, because test-retest and alternate-forms reliability could continue to be used much as they were with NRTs as long as there was time between the two measurements, little or no instruction occurring between the two measurements, and two or more forms, in the case of alternateforms reliability. Reliability of CRTs, however, has two distinct foci. The first is the reliability of scores, much as it is in NRTs, except that scores rather than being interpreted in relation to percentile ranks, are interpreted as “domain scores” with the scores representing how much of the domain a given student knows. The latter type of reliability relates to consistency of mastery status. This latter reliability, for example, could be interpreted from an alternate-form perspective as being the percentage of students who are classified with the same status when taking two forms of a measure. Popham and Husek (1969) demonstrated that a well-built CRT might have scores close to 0 or no items correct prior to instruction and close to perfect scores after instruction (see also Millman & Popham, 1974). In either case, the internal consistency reliability would be zero or close to it. Probably the first approach to the reliability of CRTs came from Livingston (1972). He treated the purpose of a CRT as to discriminate each examinee’s estimated domain score from the passing or mastery score. In the standard reliability formula, he essentially exchanged the passing score for the sample mean. Thus, a test taker who was far from the cut score on both testings would raise reliability, whereas a student who was below the passing score on one occasion and above it on the other would reduce it. Moreover, if the group mean is distant from the passing score, then the test reliability is likely to be high; if they are close, it is likely to be low. Livingston’s approach spurred considerable other work. For example, Brennan and Kane (1977) adapted this approach from a generalizability perspective, but did not achieve much use on its own; how far a score was from the cut score is less important than the likelihood that a person would achieve the same result on two testings (or in their responses throughout a given test). Hambleton and Novick (1973) suggested that the similarity of mastery decisions defined the reliability of CRTs and therefore suggested the proportion of consistent decisions made in two testings as an index of reliability. Swaminathan, Hambleton, and Algina (1974) then recommended that the Hambleton-Novick approach needed to be adjusted for chance agreements and recommended using kappa. Finally, in keeping with the original intent of CRTs, a goal is not just to identify consistency of mastery status, but also to provide it instructional objective by instructional objective. Swaminathan, Hambleton, and Algina (1975) provided an approach using a decision-theoretic approach. Subkoviak (1976, 1978) developed an internal consistency approach for estimating reliability, and Berk (1980, 1984b) discussed recommendations for estimating the reliability of CRTs.

56 Kurt F. Geisinger

The Validity of CRTs Edward Haertel (1985) and Ronald Hambleton (1984b) both describe comprehensive efforts to validate criterion-referenced tests. In general, the processes are much like the careful job analyses and content validation studies done in industrial psychology. The objectives of the curriculum must be thoroughly studied, behaviors resulting from the successful instruction in the objectives identified, the items written to match the objectives carefully, judgments of curriculum experts consider the degree to which the matches are effective, and empirical research is done to assure that these aspects have all been accomplished. With the development of statewide K–12 achievement tests, validity is maintained by measuring statewide content objectives that are taught at each grade level through a curriculum based upon those objectives. Alignment studies, which are really studies to collect content evidence of validity, are often performed where teachers and others match items to state standards to assure that a CRT measures the standards in the state curriculum. Because of the differences of score distributions of NRTs and CRTs, correlational research is less valuable generally than for NRTs. There are examples where such studies may be useful, such as correlations among two CRTs built for similar curricula or correlations with teacher judgments.

Constructing Criterion-Referenced Tests Considerable instructional literature has also developed regarding how CRTs should be constructed. Topics included in these treatments include identifying objectives and writing items to match them (Popham, 1974, 1980, 1984), writing the items themselves (Davis & Diamond, 1974; Roid, 1984), and conducting item analyses (Berk, 1980, 1984a). Domains that represent the content to be covered instructionally and in testing must be extremely clear with objectives specified for every aspect of the domain. Items must be written to be closer to objectives than is the typical case in norm-referenced measures. Item analysis includes both judgmental and empirical strategies. Judgmental strategies use scales where curriculum experts evaluate the closeness of the item to the instructional objective. Empirical techniques may compare uninstructed vs. instructed groups, pre-instructed vs. post-instructed groups, or groups previously identified as masters and non-masters (Berk, 1984a). Differences in performance across these groups must be clear. Berk provides a number of indices that can be used as item discrimination values regarding the three above comparisons. Item difficulty levels should be considered, and uninstructed group values should be low to correctly identify differences with instructed groups.

Norm- and Criterion-Referenced Testing 57

Addressing Test Length Millman (1973) and Hambleton (1984a) addressed the issue of test length, which is somewhat different for CRTs than for NRTs. For an NRT, one must determine length based on the length of time available, the reliability needed given the stakes of decision-making uses of the scores, and the bandwidth-fidelity issues. This last issue is based on the question of whether one is better to assess more constructs in a less reliable fashion, or fewer constructs in a more reliable manner. For a CRT, scores are less important than the mastery level determination. Therefore, in some cases such as in computer-adaptive testing, decisions can be made with the level of precision needed. For example, if one test taker answers a number of very difficult questions correctly to start the test, the system may realize that this test taker has certainly achieved mastery. If another test taker, however, is near the cut score, that test taker will need more questions to achieve decision confidence. Millman presented a binomial model for making mastery level determinations; however, assumptions needed for it may not always be met. Wilcox (1976) also used a binomial model but established an indifference zone above and below the passing score. Given a variable number of items for different candidates, one would answer questions until one’s score fell outside the zone, in which case they passed or failed depending on whether their score was above or below the zone. A variant on this method is also mentioned by Millman. Novick and Lewis (1974) suggested a Bayesian method to determine the probability that a given test score is higher than the passing or mastery score. In this case, prior knowledge of each test taker is normally required and is made to be a component of the equation. Such information may not be available or appropriate in some situations (e.g., licensure testing). One issue on many mastery tests is fixed-form measures with a set number of items. Presently on most such tests, one either passes or fails. It might be helpful in some high stakes situation if the decision could be deferred until more information such as a second testing is available.

Setting Performance Standards Perhaps the most studied area of CRTs is setting performance standards. Some excellent books (e.g., Cizek, 2012; Cizek & Bunch, 2007; Kaftandjieva, 2010; and Zieky, Perie, & Livingston, 2008) describe many of the methods that have been developed, used, and studied (see also Zieky, 2012). It is well beyond the capacity of this chapter to consider these many techniques individually. Rather, some general approaches are touched upon and citations already mentioned permit others to learn more about these methods. Almost all the methods deal with setting passing scores on tests. They are not very behavioral in the sense that Glaser, Hammock, and Popham wanted them to

58 Kurt F. Geisinger

be in the early days of CRTs; they do not generally provide objective-byobjective feedback. Some points should be noted for all or almost all such methods. All techniques involve judgment. Some have perceived the use of judgment to be somewhat arbitrary (e.g., Glass, 1978); others treat professional judgment in the same manner that they do a diagnosis from a physician or an opinion from an accountant, as professional. Nevertheless, all judgments can be incorrect to an extent. Judges must weigh the costs of making bad decisions. Some licensing situations make this clear. None of us would like to see airline pilots or surgeons passing a test to become licensed as a result of a Type One statistical error. On the other hand, a major U.S. city in the 1950s needed to hire several thousand teachers weeks before the school year started; they needed to relax standards to hire teachers so that there were competent adults in each classroom. Some methods involve reviewing test items and tests, others involve comparing the quality of products or the historical performances of people. Still others, not discussed here, are based on profiles of tested performance, and others on groups of people. The methods briefly mentioned below are provided simply to give readers a sense of how the methods work; hundreds of variations of techniques exist, and a similar number of research studies have examined them. Most of the methods involve judges who are competent to make evaluations of various types. There is literature on who such judges should be (Loomis, 2012; Raymond & Reid, 2001). Some normative data are often quite useful in determining cut scores as teachers and others sometimes have thoughts on what percentage of students or license applicants are likely to meet standards.

Test-based judgments Test-based judgments by judges are probably the most common techniques used in education as well as licensure and certification situations. In the case of educational decision making, panels are most often composed of teachers, sometimes with some administrators and members of the public involved as well. One of the most common methods is the so-called Angoff (1971) method, which was based on a few sentences in a classic chapter written for other purposes. The Angoff method trains judges to imagine a borderline or minimally competent student who would have just barely passed the test. Then each judge works though the test, item-by-item, and decides what proportion of minimally qualified test takers would get each item correct. After all judges have finished, several steps can be taken. They can receive their responses and those of the other judges, they can discuss different judgments amongst the group, and they sometimes receive and review data that may show them how many candidates would pass with various passing scores set on the test. Then they often have a second and even a third chance to re-make their judgements item-by-item. Often the final judgments of the raters are averaged to result in an at least preliminary passing score, although in some cases consensus discussions lead to a conclusion.

Norm- and Criterion-Referenced Testing 59

A second test-based method mentioned here is the Bookmark technique (Lewis et al., 2012). This technique was developed for tests that have been extensively pre-tested and that are scored using item response theory. Judges are assembled as they are in the Angoff procedure. The judges are again instructed to consider a borderline passing test taker or student. However, in this case they each receive a book or similar document that has the items ordered by difficulty from the easiest item to the hardest. The job of the judge is to read through the book and to place a bookmark (or line stroke) between the last item that they believe a borderline student would have a specified probability of answering correctly and the first item that they believe the student would less than that specified probability of answering incorrectly. Once again, after all judges have completed the process, they may discuss their differential results, they may receive information on the passing rates at different cut scores, and they may receive further rounds in which they can change their initial judgments. Again, summary results can be averaged or decided by group consensus.

Person or product-based methods The next two methods are based on students, not judgments regarding test items or tests. In these cases, teachers who know students evaluate the students in certain ways and the scores that those students have earned on the test are used to help set passing scores. Teachers or others could also not have long-term knowledge of students; in such cases, they might be given a product or products of a student’s work and from that set make judgments about the student. [As an editorial aside, I have always liked these methods when used with teachers. I believe that teachers know how to evaluate students better than how to estimate how many borderline students would answer a question correctly, for example. These methods are better suited for educational settings and rarely work in licensure and certification because judges are not likely to know the skill levels of test takers for those decisions.] In the Borderline Group Method teachers or other judges identify students who they believe fit that category. Such a data collection could be with teachers together in a group or with many more teachers via survey. They need not assess all students in this way, only to identify those who they believe are borderline students in the knowledge and skills involved on the test. Once such data are collected, a simple average test score for the borderline students serves as the cut score. If a survey is used rather than a group meeting, much attention must be paid to defining what a borderline student is or different teachers are likely to hold different opinions, which is a concern. A second method in which judgments about students are critical is the contrasting group method. In this group, teachers can be asked to identify students who clearly have met the requirements to pass and those who clearly should not

60 Kurt F. Geisinger

pass. The scores of these groups of students are then averaged and the two averages considered. Typically, a point between the two averages is used as the passing or cut score. It could be their average, or it could be a score that maximally differentiates the two groups.

Concluding Thoughts For many years, NRTs defined modern testing. Scores were interpreted by knowing where one stood in a distribution of peers. Such testing is quite valuable when considering student acceptances to higher education and other rewards and for employee selection. NRTs have not disappeared nor will they do so in the future. Such tests have value for many kinds of decisions. CRTs have dramatically changed many kinds of educational testing away from a purely norm-based approach. When the formalization of CRTs appeared, it excited many in testing, led to some changes in testing, and a huge research literature developed quickly. It blended well with changes in instruction and instructional technology. It is clear that decisions to be made using tests require different kinds of tests, and the decisions solved by NRTs can only rarely be solved using CRTs and vice versa. Keys to the successes of either relates to validation results, whether predictive or content-oriented as appropriate.

Note 1 The Author would like to acknowledge the thorough reading and editorial comments by Anthony Nitko and Karen Alexander, both of whom improved the chapter considerably.

References Academic Excellence (n.d.). California Achievement Test. CAT Brochure. Academic Excellence. https://portal.academicexcellence.com/local/pdfs/CAT_Flyer_0317.pdf. Anastasi, A. (1958). Differential psychology (3rd ed.) New York, NY: Macmillan. Anastasi, A. & Urbina, S. (1997) Psychological testing (7th ed.). Upper Saddle River, NJ: Prentice Hall. Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 508–600). Washington, DC: American Council on Education. Berk, R. A. (1980). Item analysis. In R. A. Berk (Ed.), Criterion-referenced measurement: The state of the art (pp. 49–79). Baltimore, MD: Johns Hopkins University Press. Berk, R. A. (1984a). Conducting the item analysis. In R. A. Berk (Ed.), A guide to criterionreferenced test construction (pp. 97–143). Baltimore, MD: Johns Hopkins University Press. Berk, R. A. (1984b). Selecting the index of reliability. In R. A. Berk (Ed.), A guide to criterion-referenced test construction (pp. 231–266). Baltimore, MD: Johns Hopkins University Press. Bingham, W. V. (1937). Aptitudes and aptitude testing. New York, NY: Harper.

Norm- and Criterion-Referenced Testing 61

Binet, A. & Henri, V. (1896). La psychologie individuelle. L’Année Psychologique, 2, 411– 465. Binet, A. & Simon, T. (1905). New methods for the diagnosis of the intellectual level of subnormals. L’Année Psychologique, 11, 163–191. Brennan, R. L., & Kane, M. T. (1977). An index of dependability for mastery tests. Journal of Educational Measurement, 14, 277–289. California Test Bureau (CTB) (2012). Our Heritage: Celebrating 85 Years. https://web. archive.org/web/20120320121444/http://www.ctb.com/ctb.com/control/ctbLa ndingPageViewAction?landngPageId=25190. Cattell, J. McK. (1890). Mental tests and measurements. Mind, 15, 373–381. Cattell, J. McK. & Farrand, L. (1896). Physical and mental measurements of the students of Columbia University, Psychological Review, 3, 618–658. Cizek, G. J. (Ed.) (2012). Setting performance standards: Concepts, methods and perspectives (2nd ed.). New York, NY: Routledge. Cizek, G. J., & Bunch, M. B. (2007). Standard setting: A guide to establishing and evaluating performance standards on tests. Thousand Oaks, CA: Sage. Cronbach, L. J. & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302. Data Recognition Corporation (n.d.). TerraNova (3rd ed.). Overview Brochure. Data Recognition Corporation. http://terranova3.com/PDFs/TerraNova3_Overview_Bro chure.pdf. Davis, F. B. (1952). Item analysis in relation to educational and psychological testing. Psychological Bulletin, 49, 97–121. Davis, F. B. & Diamond, J. J. (1974). The preparation of criterion-referenced tests. In C. W. Harris, M. C. Alkin, & W. J. Popham (Eds), Problems in criterion-referenced measurement (pp. 116–138). CSE monograph series in evaluation, No. 3. Los Angeles, CA: Center for the Study of Evaluation, University of California. DuBois, P.H. (1970). A history of psychological testing. Boston, MA: Allyn & Bacon. Ebel, R. L. (1962). Content standard test scores. Educational and Psychological Measurement, 22(1), 15–25. Elman, B. A. (1991). Political, social, and cultural reproduction via civil service examinations in late imperial China. The Journal of Asian Studies, 50(1), 7–28. Flanagan, J. C. (1951). Units, scores and norms. In E. F. Lindquest (Ed.), Educational measurement (pp. 695–763). Washington, DC: American Council on Education. Geisinger, K. F. (2005). The testing industry, ethnic minorities, and individuals with disabilities. In R. P. Phelps (Ed.), Defending standardized testing (pp. 187–203). Mahwah, NJ: Erlbaum. Glaser, R. (1963). Instructional technology and the measurement of learning outcomes: Some questions. American Psychologist, 18, 519–521. Glaser, R., & Nitko, A. J. (1971). Measurement in learning and instruction. In R.L. Thorndike (Ed.), Educational measurement. (2nd ed.) (pp. 625–670). Washington, DC: American Council on Education. Glass, G. V. (1978). Standards and criteria. Journal of Educational Measurement, 15, 237–261. Goddard, H. H. (1908). The Binet and Simon tests of intellectual capacity. The Training School, 5, 3–9. Goddard, H. H. (1910). A measuring scale of intelligence. The Training School, 6, 146–155. Green, B. F. (1954). A note on item selection for maximum validity. Educational and Psychological Measurement, 14, 161–164.

62 Kurt F. Geisinger

Gray, W. M. (1978). Comparison of Piagetian theory and criterion-referenced measurement. Review of Educational Research, 48, 223–248. Guilford J. P. (1946). New standards for test evaluation. Educational and Psychological Measurement, 6, 427–438. Haertel, E. (1985). Construct validity and criterion-referenced testing. Review of Educational Research, 55, 23–46. Hambleton, R. K., (1984a). Determining test length. In R. A. Berk (Ed.), A guide to criterion-referenced test construction (pp. 144–168). Baltimore, MD: Johns Hopkins Press. Hambleton, R. K., (1984b). Validating the test scores. In R. A. Berk (Ed.), A guide to criterion-referenced test construction (pp. 199–230). Baltimore, MD: Johns Hopkins Press. Hambleton, R. K., & Novick, M. R. (1973). Toward an integration of theory and method for criterion-referenced tests. Journal of Educational Measurement, 10, 159–170. Hambleton, R. K., Swaminathan, H., Algina, J., & Coulson, D. B. (1978). CriterionReferenced Testing and Measurement: A Review of Technical Issues and Developments. Review of Educational Research, 48, 1–47. Hammock, J. (1960a). Criterion measures: Instruction vs. selection research. Paper presented at the annual meeting of the American Psychological Association, Chicago, IL, Sept., 1960. Hammock, J. (1960b). Criterion measures: Instruction vs. selection research. [Abstract]. American Psychologist, 15, 435. Henryssen, S. (1971). Gathering, analyzing, and using data on test items. In R. L. Thorndike (Ed.), Educational measurement (2nd ed.) (pp. 130–159). Washington, DC: American Council on Education. Hiskey, M. S. (1966). Hiskey-Nebraska Test of Learning Aptitude. Lincoln, NE: Union College Press. Kaftandjieva, F. (2010). Methods for setting cut scores in criterion-referenced achievement tests: A comparative analysis of six recent methods with an application to tests of reading in EFL. Arnhem, Netherlands: CITP/European Association for Language Testing and Assessment. Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational measurement (4th ed.) (pp. 17–64). Westport, CT: American Council on Education/Praeger. Kane, M. (2010). Validity and fairness. Language Testing, 27(2), 177–182. https://doi.10. 1177/0265532209349467. Kelley, T. L. (1927). Interpretation of educational measurement. Yonkers, NY: World Book. Kelley, T. L., Ruch, G. M., & Terman, L. M. (1922). Stanford Achievement Test. Yonkers, NY: World Book. Lee, J. M. (1936). A guide to measurement in secondary schools. New York, NY: AppletonCentury. Lemann, N. (1999). The Big Test: The Secret History of the American Meritocracy. New York, NY: Farrar, Straus, & Girroux. Lewis, D. M., Mitzel, H. C., Mercado, R. L., & Schulz, E. M. (2012). The bookmark standard setting procedure. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods and perspectives (2nd ed.) (pp. 225–254). New York, NY: Routledge. Livingston, S. A. (1972). Criterion-referenced applications of classical test theory. Journal of Educational Measurement, 9, 13–26. Loomis, S. C. (2012). Selecting and training standard setting participants: State of art policies and procedures. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods and perspectives (2nd ed.) (pp. 107–134). New York, NY: Routledge. Magnusson, D. (1967). Test theory. Reading, MA: Addision-Wesley.

Norm- and Criterion-Referenced Testing 63

Messick, S. (1998). Test validity: A matter of consequence. Social Indicators Research, 45, 35–44. Millman, J. (1973). Passing scores and test lengths for domain-referenced measures. Review of Educational Research, 43, 205–216. Millman, J. (1974). Criterion-referenced testing. In W. J. Popham (Ed.), Evaluation in education: Current applications (pp. 309–397). Berkeley, CA: McCutchan. Millman, J., & Popham, W. J. (1974). The issue of item and test variance for criterionreferenced tests: A clarification. Journal of Educational Measurement, 11, 137–138. Nitko, A. J. (1980). Distinguishing the Many Varieties of Criterion-Referenced Tests. Review of Educational Research, 50(3), 461–485. https://www.jstor.org/stable/1170441. Novick, M.R. & Lewis, C. (1974). Prescribing test length for criterion-referenced measurement. In C. W. Harris, M. C. Alkin, & W. J. Popham (Eds), Problems in criterionreferenced measurement (pp. 139–158). CSE monograph series in evaluation, No. 3. Los Angeles, CA: Center for the Study of Evaluation, University of California. Popham, W. J. (Ed.) (1971). Criterion-referenced measurement: An introduction. Englewood Cliffs, NJ: Educational Technology Publications. Popham, W. J. (1974). Selecting objectives and generating test items for objectives-based tests. In C. W. Harris, M. C. Alkin, & W. J. Popham (Eds), Problems in criterion-referenced measurement (pp. 13–25). CSE monograph series in evaluation, No. 3. Los Angeles, CA: Center for the Study of Evaluation, University of California. Popham, W. J. (Ed.) (1978). Criterion-Referenced Measurement, Englewood Cliffs, NJ: Prentice-Hall, Inc., Popham, W. J. (1980). Domain specification strategies. In R. A. Berk (Ed.), Criterionreferenced measurement: The state of the art (pp. 15–31). Baltimore, MD: Johns Hopkins University Press. Popham, W. J. (1984). Specifying the domain of content or behaviors. In R. A. Berk (Ed.), A guide to criterion-referenced test construction (pp. 29–48). Baltimore, MD: Johns Hopkins University Press. Popham, W. J., & Husek, T. R. (1969). Implications of criterion-referenced measurement. Journal of Educational Measurement, 6, 1–9. Price, L. R. (2017) Psychometric methods: Theory into practice. New York, NY: The Guilford Press. Raymond, M. R., & Reid, J. B. (2001). Who made thee a judge? Selecting and training participants for standard setting. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods and perspectives (pp. 119–157). Mahwah, NJ: Erlbaum. Rice, J. M. (1897). The futility of the spelling grind. The Forum, 23, 163–172. Rice, J. M. (1902). Educational research: A test in arithmetic. The Forum, 34, 281–297. Rice, J. M. (1903). Educational research: The results of a test in language. The Forum, 35, 269–293. Roid, G. H. (1984). Generating the test items. In R. A. Berk (Ed.), A guide to criterionreferenced test construction (pp. 49–77). Baltimore, MD: Johns Hopkins University Press. Schwab, D. P., Heneman, H. G., & Decotiis, T. A. (1975). Behaviorally anchored rating scales: A review of the literature. Personnel Psychology, 28, 549–562. Smith, P. C., & Kendall, L. M. (1963). Retranslation of expectations: An approach to the construction of unambiguous anchors for rating scale. Journal of Applied Psychology, 47, 149–155. Starch, D. (1916). Educational measurement. New York, NY: Macmillan. Stern, W. (1911). Die differentielle Psychologie in ihren metodischen Gruindlagen. Leipzig: Barth.

64 Kurt F. Geisinger

Subkoviak, M. J. (1976). Estimating reliability from a single administration of a mastery test. Journal of Educational Measurement, 13, 265–276. Subkoviak, M. J. (1978). Empirical investigations of procedures for estimating reliability for mastery tests. Journal of Educational Measurement, 15, 111–116. Swafford, K. L. (2007). The use of standardized test scores: An historical perspective [Unpublished doctoral dissertation]. University of Georgia. Swaminathan, H., Hambleton, R. K., & Algina, J. (1974). Reliability of criterion-referenced tests: A decision-theoretic formulation. Journal of Educational Measurement, 11, 263–268. Swaminathan, H., Hambleton, R. K., & Algina, J. (1975). A Bayesian decision-theoretic procedure for use with criterion-referenced tests. Journal of Educational Measurement, 12, 87–98. Terman, L. M. (1906). Genius and stupidity. Pedagogical Seminary, 13, 307–373. Terman, L M. (1916). The measurement of intelligence. Boston, MA: Houghton Mifflin. Thorndike, E. L. (1904). An introduction to the theory of mental and social measurements. New York, NY: Science Press. Thorndike, E. L. (1910). Handwriting. Teacher’s College Record, 11(2), 86–151. Thorndike, E. L. (1914). The measurement of ability in reading: Preliminary scales and tests. Teacher’s College Record, 15, 202–277. Thurstone, L. L. (1932). The reliability and validity of tests. Ann Arbor, MI: Edwards. Wechsler, D. (1939). The measurement of adult intelligence. Baltimore MD: Williams & Witkins. Wilcox, R. R. (1976). A note on the length and passing score of a mastery test. Journal of Educational Statistics, 1, 359–364. Whipple, W. H. (1910). Manual of mental and physical tests. Baltimore, MD: Warwick and York. Wissler, C. (1901). The correlation of mental and physical tests. Psychological Review, 3, 1–62. Wood, B. D. (1923). Measurement in higher education. New York, NY: Harcourt, Brace and World. Yoakum, C. S, &Yerkes, R. M. (1920). Army mental tests. New York, NY: H. Holt. Zieky, M. J. (2012). So much has changed: An historical overview of setting cut scores. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods and perspectives (2nd ed.) (pp. 15–32). New York, NY: Routledge. Zieky, M. J., Perie, M., & Livingston, S. A. (2008). Cutscores: A manual for setting standards of performance on educational and occupational tests. Princeton, NJ: Educational Testing Service.

4 THE ROLE OF THE FEDERAL GOVERNMENT IN SHAPING EDUCATIONAL TESTING POLICY AND PRACTICE Michael B. Bunch1

Educational measurement professionals are problem solvers, and the problems we solve are usually defined by others. For the past century, federal policy and initiatives have shaped the choice of problems educational measurement specialists have had to solve and the opportunities they have had to advance theory and practice. From the establishment of the Committee on the Examination of Recruits in 1917 to the passage of the Every Student Succeeds Act in 2015, the federal government has provided resources and direction for advances in educational testing. This chapter traces the role of the government of the United States in shaping testing policy and practice over the past century. The enactment of the Elementary and Secondary Education Act of 1965 (ESEA; PL 89–10) and its various reauthorizations slowly but steadily changed the face of educational assessment in this country. But the federal role in testing began long before 1965, in 1917 to be precise.

Army Alpha and Beta Tests World War I brought into sharp focus the need to find out very quickly the mental capacities of large numbers of soldiers. For some, this military necessity was also an opportunity to test new theories and engage in social engineering. To these ends, a small but powerful group of American test practitioners secured government sanction to implement their tests on a very large scale. These individuals kept in close touch with one another after their brief involvement in the war effort, and continued for many years to drive innovations in testing. Robert Yerkes had petitioned the War Department in 1916 to let psychologists test recruits, as Henry Goddard was doing in Canada (Reed, 1987). When America entered the war in 1917, the Surgeon General established the Committee on the Examination of Recruits – headed up by Yerkes and including these

66 Michael B. Bunch

seven men: Walter Van Dyke Bingham, Edgar A. Doll, Henry H. Goddard, Thomas H. Haines, Lewis M. Terman, Frederick L. Wells, and Guy M. Whipple. Yerkes was also president of the American Psychological Association (APA) in 1917. It was his position as APA president, along with his connections with the National Research Council and tireless crusading for the employment of psychology in as many human endeavors as possible, that made the committee possible. Together, these eight men developed the Army Alpha and Beta tests and oversaw the mental testing of over 1.75 million men during the summer of 1917. They set the stage for generations of ability testing. The committee modified portions of the Binet intelligence tests (cf. Geisinger, Ch. 3, this volume) to a group-administered format. This format permitted the Committee to test hundreds of men at a time with the Army Alpha test for literate, English-speaking men and the Army Beta test for men who were illiterate or whose first language was other than English.

Testing of School Children post World War I By 1918, every state required school attendance through high school. Also, between 1880 and 1920, the U.S. population had increased from just over 50 million to more than 105 million (U.S. Department of Commerce, 1975). As a result of these two factors, schools were becoming very crowded. One proposed solution to overcrowding was to create smaller groups of students of homogeneous ability to increase the efficiency of teaching. This differentiation was prodded along by the efficiency movement (Taylor, 1915) that had marked American business in the late 1800s and early 1900s as well as the growing fascination with the ideas of Charles Darwin (1859). Differentiation was at once both a recognition of the natural (and presumed hereditary) differences so clearly explained by Darwin and a solution to educating those who were incapable of achieving at a high level. Mental ability testing was seen as a way of accomplishing the differentiation. The transition from military to civilian life was quick and easy for the ability tests. Edward L. Thorndike was among the first to see the potential for using slightly modified versions of the Army Alpha with school children: As one considers the use of intelligence tests in the army, the question at once arises, “if for the sake of war we can measure roughly the intelligence of a third of a million soldiers in a month, and find it profitable to do so, can we not each year measure the intelligence of every child coming ten years of age, and will not that be still more profitable?” (Thorndike, 1919) Indeed, some psychologists had given leftover Army Alpha tests in their original form to high school and college students immediately after the war. The Committee, with the financial backing and prestige of the federal government, had

The Role of the Federal Government 67

developed a set of tools to help schools sort and sift students so that all could benefit from public education. While the benefits of testing may have been overstated (cf. Gould, 1981), educational testing on a grand scale was about to begin.

Further Influence of the Alpha and Beta Tests At the same time, it should be noted that members of the Committee on the Examination of Recruits went on to commercialize intelligence tests. Henry Goddard had already published a version of the Binet test in 1908 and continued to promote it after the war. World Book published Terman’s Group Test of Mental Ability (Terman, 1920) and Arthur Otis’s Group Intelligence Scale (Otis, 1921) as well as the Stanford Achievement Test (Kelley, Rice, & Terman, 1922). Guy Whipple published the National Intelligence Tests under the auspices of the National Research Council (Whipple 1921). Walter Bingham went on to found the Psychological Corporation with James McKeen Cattell in 1921. The Psychological Corporation would later publish the Wechsler-Bellevue Intelligence Scales (Wechsler, 1939) as well as other educational and psychological tests over the next 75 years. Bingham and Terman in particular had a tremendous impact on the field over the remaining years of their lives, teaching (Bingham at Chicago and Terman at Stanford) and publishing books and articles on intelligence and ability testing (cf. Bingham, 1937; Terman, 1916, 1925). These tests and the writings of these few men would not only define intelligence for America, they would also establish the role of large-scale testing in America – the sorting and classifying of students through the application of norms. IQ tests would soon become ubiquitous and play an ever-increasing role in decisions about a wide range of human endeavors, from employment to college admission to placement in school curricula.

Federal Legislation Federal legislation has become an increasingly prominent shaper of educational testing policy and practice since the 1950s. This section traces the federal role from the establishment of the Department of Health, Education, and Welfare in 1953 through the passage of the Every Student Succeeds Act (ESSA) in 2015.

HEW/USOE When President Dwight D. Eisenhower established the Department of Health, Education, and Welfare (HEW) in 1953, that department included the U.S. Office of Education (USOE). The early years of the USOE were relatively quiet. Highlights included the following:

68 Michael B. Bunch

 



In 1954, the Cooperative Research Act authorized the Office to conduct cooperative research with educational institutions. In 1955 the White House Conference on Education produced recommendations for improvement of elementary and secondary schools, leading to an expansion of financial aid to schools. Not much else.

USOE spent most of the next two years addressing the recommendations of the 1954 White House Conference. HEW was more focused on health and welfare. (Office of the Assistant Secretary for Planning and Evaluation, 1972). Then came Sputnik I. The launch of the first artificial Earth satellite by the Soviet Union created great anxieties with the American public, and was widely viewed as a tangible demonstration of Soviet superiority in science, math, and engineering. American education – specifically the American/Russian education gap – took center stage. In 1958, roughly 30 percent of high school juniors and seniors were studying science, and fewer than 25 percent were studying math (ASPE, 1972). Few were studying foreign languages. Suddenly, the need for more teachers and better educational methods became quite acute. It was clearly time for action. Congress and the president were ready to act.

NDEA: 1958 Congress responded to the challenges of Sputnik I with the National Defense Education Act (NDEA; PL 85–864), a central purpose of which was to identify, as quickly and efficiently as possible, America’s most capable young scientific minds. Congress appropriated $15 million for 1959 – not a huge sum, even for that era – to set up testing programs, among other things (Section 501). This act marked the first time the U.S. government had intervened in the educational testing of schoolchildren. It would not be the last. Congress and the White House were quick to point out that the federal government would have no control over education (Section 102). The testing programs established by this law focused on science, mathematics, and foreign languages, specifically on ability in these subjects, and if a state could not legally pay for the tests, USOE would pay for them (Section 504 (b)). The problem to solve in 1958 was not a measurement one. It was one of how to beat the Russians to outer space. More specifically, NDEA provided money to states to administer tests to see if students were capable of doing science and math well enough to go to college and learn how to build rockets. Standardized normreferenced tests were the instruments of choice. President Eisenhower completed his second term and was succeeded by John F. Kennedy in 1961. Americans were still trying to get to outer space before the Russians, and President Kennedy had promised that by the end of the 1960s, we would put a man on the moon and return him safely to earth. We still needed to

The Role of the Federal Government 69

identify our best and brightest, get them into good colleges and produce better scientists and mathematicians than the Russians, so the programs underwritten by NDEA continued.

ESEA and Reauthorizations: 1965–1974 On November 22, 1963, upon the death of President Kennedy, Vice President Lyndon B. Johnson became president. He immediately ushered in the Great Society, starting with a handful of bills designed to fight poverty. The Elementary and Secondary Education Act (ESEA; PL 89–10) of 1965 was part of that original Great Society package. It would be reauthorized and/or amended every few years for the next five decades. PL 89–10 said nothing about testing/assessment at any level, but it did provide the mechanisms for establishing, implementing, and requiring specific tests in the future. In particular, the establishment of a National Advisory Council (Section 212) and encouragement of research activities (Title IV) cleared a path for tests to come later. ESEA not only helped to identify problems for educational measurement professionals to solve in the future, it also provided a considerable amount of money ($100 million, per Section 403) to identify and solve those problems. Section 503 (a) (1) addressed planning for and evaluation of programs. The evaluation component of ESEA was largely the result of efforts by New York’s junior senator, Robert F. Kennedy. Senator Kennedy wanted states and districts to be accountable for providing real aid to disadvantaged children. He had seen enough federal tax dollars siphoned off to purposes unrelated to the enabling legislation to make him want to push for accountability in ESEA (McLaughlin, 1974). In 1966, the same Congress amended ESEA to require aid to all handicapped students (PL 89–750). The following year, the 90th Congress added bilingual education (PL 90–247). Requirements for testing would soon follow. These two subpopulations would play larger and larger roles in subsequent reauthorizations as court cases (cf. Lau v. Nichols) would bring more public attention to them.

Testing handicapped and bilingual students Rank ordering was used in identifying the best and the brightest but was not an appropriate paradigm for testing handicapped and bilingual students. Section 602 of the 1966 amendments defined handicapped children to include “mentally retarded, hard of hearing, deaf, speech impaired, visually handicapped, seriously emotionally disturbed, crippled, or other health impaired children who by reason thereof require special education and related services.” Section 604 required state plans specifically designed “to give reasonable promise of substantial progress toward meeting those needs.” Simply identifying such children, aged 3 to 21 (per

70 Michael B. Bunch

Section 603), would require advances in testing, and has been and continues to be one of the thorniest problems educational testing professionals face. Title VII of the 1967 amendments provided for educational programs for bilingual children (defined as “children who come from environments where the dominant language is other than English”). While federal reporting requirements focused squarely on guaranteeing opportunities for children of limited English ability, rather than actual improvement in their English proficiency, the groundwork was laid for later requirements for demonstrating effectiveness through testing of these students. Between 1965 and 1967, federal law established three specific classes of students who would receive supplemental educational services (disadvantaged, handicapped, and bilingual), paid for by federal tax dollars, and for which local and state educational agencies would be held accountable. Accountability, initially couched in terms of fiscal responsibility and equity with regard to identification of and service to identified groups, would gradually encompass evaluation of effectiveness. And that would require the development of new tests and new testing models. In 1974, ESEA was reauthorized to introduce a top-down federal approach to standardizing and modernizing educational assessment through the establishment of the Title I Evaluation and Reporting System, a set of program evaluation models and ten regional technical assistance centers to assist state and local education agencies in implementing them. The centers could advise up to a point but could not recommend specific tests or programs. Someone had to create that system, and others had to carry it to every state and local education agency in the country. RMC Research Corporation developed the Title I Evaluation and Reporting System (TIERS; Tallmadge & Wood, 1976) consisting of three models. The U.S. Office of Education then awarded contracts for ten Technical Assistance Centers to assist states and districts in implementing them. With the passage of the 1974 amendments (PL 93–380), the Commissioner of Education was required to develop program evaluation models which “specify objective criteria,” “outline techniques…and methodology,” and produce data which are “comparable on a statewide and nationwide basis” (Section 151). During the 1970s and 1980s the Title I evaluation models were disseminated throughout the country by a group of technical assistance centers (TACs) The TACs, in their time, were much like the team that created, administered, and popularized the Army Alpha tests in 1917. These ten TACs employed hundreds of testing specialists and interacted with education officials in all fifty states, the District of Columbia, and five territories to bring about a standardized approach to solving an educational measurement problem. At their peak (November 1979 to September 1981), the ten TACs provided over 5,000 workshops and on-site consultations for more than 80,000 clients (Stonehill & Anderson, 1982). The centers’ significance lay not so much in the technical changes in educational measurement they engendered (there weren’t

The Role of the Federal Government 71

many) but in the upgrading and standardization of testing practice within and across states, at least as it applied to Title I evaluation, which was the largest target available at the time. As was the case at the close of World War I, the psychometricians and other testing professionals associated with the Title I TACs dispersed, carrying with them skills and ideas they had acquired while working on this project and changing the face of educational measurement over the next 35 years. Some would go on to start new testing companies or revitalize those started earlier in the century; others would move on to teach in major universities and train the next generation of psychometricians.

Education for All Handicapped Children Act: 1975 In 1975, Congress passed a companion law, the Education for All Handicapped Children Act of 1975 (PL 94–142), focusing a spotlight on the plight of some eight million handicapped children receiving inadequate or irrelevant educational services. This law mandated “free appropriate public education which emphasizes special education” (Section 3 (c)) and specifically referenced “children with specific learning disabilities” (Section 4 (a)). This section also called for “early identification and assessment of handicapping conditions in children.” As with ESEA reauthorizations, PL 94–142 required state plans describing services to be provided (Section 613) which were far more detailed than those of PL 89–750. It is important to note that the Education of All Handicapped Children Act effectively made handicapped children a protected class separate from those covered by ESEA. They would be rejoined in subsequent legislation. Testing requirements under PL 94–192 were greatly tightened, relative to the 1966 amendments, to assure that testing and evaluation materials and procedures utilized for the purposes of evaluation and placement of handicapped children will be selected and administered so as not to be racially or culturally discriminatory. Such materials or procedures shall be provided and administered in the child’s native language or mode of communication, unless it clearly is not feasible to do so, and no single procedure shall be the sole criterion for determining an appropriate educational program for a child (Section 612 (c))

Reauthorization of ESEA: 1978 In 1978, Congress reauthorized and amended ESEA through Public Law 95–561. Section 505 expanded the scope of evaluation technical assistance to include specific responsibilities for state education agencies (SEAs). Title VII reiterated or expanded the emphasis on bilingual education and evaluation thereof. Title IX,

72 Michael B. Bunch

Part B spelled out requirements for SEAs to establish proficiency standards (Section 921), with these provisos: (b) (2) (C) may provide for the administration of examinations to students, at specified intervals or grade levels, to measure their reading, writing, or mathematical proficiency, or their proficiency in other subjects which the applicant considers appropriate for testing; and (D) shall contain the assurances of the applicant that any student who fails any examination provided for under subparagraph (C) of this paragraph shall be offered supplementary instruction in the subject matter covered by such examination. To these ends, Section 922 authorized the Commissioner of Education to assist SEAs in developing capacity to conduct large-scale achievement testing programs. However, as in previous legislation, PL 95–561 was clear as to the voluntary nature of the tests developed or selected under its provisions, allowing local education agencies (LEAs) to opt out of any test or test item so developed or selected. Section 1242 brought the National Assessment of Educational Progress (NAEP) under direct federal control by making it the responsibility of the National Institute of Education (NIE). It also established a National Assessment Policy Committee (which would later become the National Assessment Governing Board – NAGB). Section 922 provided extremely specific direction for what the National Assessment was to accomplish and how it was to accomplish it. The next section of this chapter explores NAEP in greater detail.

ECIA: 1981 Ronald Reagan was elected president in 1980, and the federal government’s role in education quickly changed. Congress passed the Omnibus Budget Reconciliation Act of 1981 (PL 97–35), which included a subsection devoted to reauthorization of ESEA – the Educational Consolidation and Improvement Act (ECIA). Title I of ESEA became Chapter 1 of this subsection. In essence, the reauthorization was a rollback in requirements and a return to more local control, aligning with President Reagan’s agenda for educational reform. ECIA also pushed the testing envelope a bit farther than previous reauthorizations had done. Section 582 (2) contained the following language: (B) establishment of educational proficiency standards for reading, writing, mathematics, or other subjects, the administration of examinations to measure the proficiency of students, and implementation of programs (coordinated with those under subchapter A of this chapter) designed to assist students in achieving levels of proficiency compatible with established standards;

The Role of the Federal Government 73

Evaluation of Title I effectiveness was shifting from norm-referenced interpretations of test scores to criterion-referenced interpretations, with criteria to be established in ways never before imagined. This particular requirement would be repeated with greater specificity and force in subsequent federal legislation. The act contained one other interesting provision – a focus on sustained effects (cf. Carter, 1983). Prior to ECIA, Title I evaluation had focused on annual achievement status. By 1981, Congress was interested in how children receiving Title I services fared over time, particularly a year or two after exiting Title I programs. As of 1981, schools, districts, and states would be required to establish quantifiable educational outcomes, administer criterion-referenced tests to see if those outcomes were being met, and follow up to make sure those outcomes were sustained. Complying with this federal mandate would keep psychometricians busy for several years.

Hawkins-Stafford: 1988 The Hawkins-Stafford Elementary and Secondary School Improvement Amendments of 1988 (PL 100–297) established the National Assessment Governing Board (NAGB) to oversee the National Assessment of Educational Progress, which had been under federal control since 1978. The 1988 amendments gave NAGB the task of “identifying appropriate achievement goals for each age and grade in each subject area to be tested under the National Assessment;” (Section 3403 (6) (A) ii). NAEP proficiency levels had been set in terms of standard deviations above or below the scale mean. PL 100–297 would require that those levels have criterion-referenced meaning. The procedures for setting cut scores were fairly limited at the time. Over the next decade, the playbook for standard setting would need to be rewritten. PL 100–297 also authorized NAGB to conduct trial state assessments which would pave the way for cross-state comparisons (in addition to the regional comparisons then in place). The “trial” designation for the state assessments would be replaced by “voluntary” in the next reauthorization.

IASA/Goals 2000: 1994 In 1994 congress passed the Improving America’s Schools Act (IASA; PL 103– 382) and a companion law Goals 2000 Educate America Act (PL 103–227), largely in response to A Nation at Risk (National Commission on Excellence in Education, 1983), a scathing report on the failure of America’s schools. IASA called for assessments aligned with the content standards to be administered “at some time” between grades 3 and 5, again between grades 6 and 9, and again between grades 10 and 12; i.e., a total of three grades in any school year (Section 1111 (b) (3) (D)). The assessments should include “multiple, up-todate … measures that assess higher-order thinking skills and understanding”

74 Michael B. Bunch

(Section 1111 (b) (3) (E)) and “provide individual student interpretive and descriptive reports” (Section 1111 (b) (3) (H)) as well as disaggregated results within states, districts, and schools by gender, race, limited-English-proficient status, migrant status, disability, and economic status. The testing screws were tightened another notch: Use of low-level tests that are not aligned with schools’ curricula fails to provide adequate information about what children know and can do and encourages curricula and instruction that focus on the low-level skills measured by such tests. (Section 1001 (c) (3)) Adequate yearly progress was also defined (Section 1111(b) (2) (A) (ii)). The new law provided funding for development of new types of tests – tests measuring higher-order thinking skills in ways that had not been attempted before. The law also established some very specific expectations of these new tests, particularly with respect to score reports and opportunity to learn. IASA presented new problems to solve:    

Alignment issues – ultimately ushering in the Common Core State Standards (CCSS) Opportunity to learn issues, partly in response to court cases (cf. Debra P. v. Turlington) Creation of new item types Defining and measuring adequate yearly progress (AYP)

IASA also put pressure on test developers and administrators to come up with meaningful, usable test score reports and help end users make the most of them.

IDEA: 1997 The Individuals with Disabilities Education Act (IDEA; PL 105–17) specifically required that students with handicaps be included in any testing program, using alternate assessments where necessary (Section 612). Such alternate assessments would be based on alternate achievement standards if they existed and would be constructed using the principles of universal design (cf. Thurlow et al., 2016).

NCLB: 2002 No Child Left Behind (NCLB; PL 107–110) contained virtually all the testing language from IASA plus language from IDEA to cover testing of students with disabilities, effectively reuniting the three classes of students covered by the 1966 and 1967 amendments.

The Role of the Federal Government 75

NCLB expanded the scope of testing to grades 3–8 plus one grade in high school. Previously (i.e., under IASA), schools and districts could test one grade in elementary school, one grade in middle school, and one grade in high school. This change in language more than doubled the number of students tested in some states. An ongoing emphasis on education and testing of ethnic and language minorities and students with disabilities had been a central feature of every reauthorization of ESEA since 1966. In NCLB, Congress explicitly called for special assessments for English language learners and provided funds for development of those tests through the Enhanced Assessment Grants (Section 6112). Four consortia of states obtained grants under Section 6112 and developed these tests:    

Comprehensive English Language Learner Assessment (CELLA; lead state Pennsylvania) English Language Development Assessment (ELDA; lead state Nevada) Mountain West Assessment (MWA; lead state Utah) Assessing for Comprehension and Communication in English State to State for English Language Learners (ACCESS-ELLs; lead state Wisconsin)

NCLB also encouraged state consortia to develop tests for the general population. Ultimately, this provision led to competitive award of grants to the Partnership for Assessment of Readiness for College and Career (PARCC) and the Smarter Balanced Assessment Consortium (Smarter Balanced) to create and administer tests to students in multiple states. Other consortia would also develop alternate assessments:  

Dynamic Learning Maps (DLM) Alternate Assessment Consortium National Center and State Collaborative (NCSC)

Defining “proficient” became a primary focus of measurement professionals during the first decade of this century. The term was already in common usage as a result of 30 years of NAEP reports. Nevertheless, defining “proficient” in a way that would work in every state was a challenge. With the expansion of testing to a contiguous band of grades (3 through 8), defining “proficient” was more than a one-grade challenge. States grappled with this problem for years. In response, Applied Measurement in Education devoted an entire issue to vertical articulation of cut scores (Cizek, 2005). While NCLB did not specifically mention college and career readiness, that term came to dominate assessment design and development for the next decade. The law reiterated the need for rigorous, uniform academic standards and tests to measure mastery of those standards stated in IASA. The next step would be the creation of the Common Core State Standards (National Governors Association Center for Best Practices, Council of Chief State School Officers, 2010a, 2010b) which were quite explicit about college and career readiness.

76 Michael B. Bunch

This was a paradigm shift. Since the first mention of standards and assessment in the 1974 reauthorization (PL 93–380), all testing had been to determine the extent to which students had mastered what they had just been taught. With college and career readiness, the focus shifted to predictive validity – the extent to which mastery of what has just been taught would lead to mastery in whatever comes next. This concept applied not just to college and career readiness of high school students but to readiness for successful participation in instruction in the next grade for elementary and middle school students. To facilitate comparability of test results across states, NCLB not only encouraged consortia as previous reauthorizations had done, it provided funding to make sure comparability was achieved. PARCC and Smarter Balanced together received over $330 million to create tests (Press Office, ED, 2010). There were additional challenges as well: new data systems, online testing, and other computer-related issues. Since the previous reauthorization of ESEA, Congress had passed the Individuals with Disabilities Education Act (IDEA; PL 105–17) of 1997 with specific reference to how special education students should be tested. By referencing a subsection of that law (Section 612 (a) (16)) in NCLB, Congress required that students with disabilities be given alternate assessments based on alternate academic standards if the state had developed them. It would not take long, under the peer review process, for all states to have such standards. Thus, educational measurement professionals had a new challenge: working with special educators to create these new tests based on these new standards.

ESSA (2015) While much of the language of NCLB and its predecessors remains in the Every Student Succeeds Act (ESSA; PL 114–95), the main effect of ESSA has been to move much of the authority and responsibility for assessment back to the states. In most instances, this means looser requirements. The exception is assessment of English language learners (ELLs). Under ESSA, not only do states have to assess ELLs, they have to show that the students are progressing year to year. While NCLB referenced the “alternate assessments based on alternate academic standards” section of IDEA, ESSA includes that language explicitly. These requirements raise again the growth/AYP problem that educational measurement professionals had been working on for 15 years under NCLB. Another new wrinkle in ESSA was the requirement to work with higher education. This is largely an extension of the College/Career Ready notion introduced by the Common Core. The new language, relative to NCLB, is that Section 1111 calls for alignment of content standards with credit-bearing postsecondary coursework. Given the advances in technology since 2002, ESSA has more to say about the use of computer technology. At the time of passage of NCLB in 2002, very few large-scale tests were delivered online. However, by 2015, both PARCC and

The Role of the Federal Government 77

Smarter Balanced tests had been administered online, and Smarter Balanced tests were being delivered in computer adaptive mode. Section 1111 provides specific guidelines for the administration of online computer adaptive tests, including the use of test items above or below the student’s grade level. One of the requirements relaxed in ESSA was the provision that states may administer a series of interim assessments and use their aggregated results in the place of a single summative assessment. This option no doubt creates new opportunities for psychometricians to come up with creative solutions.

Summary Over the course of 60 years, the federal government has steadily increased its involvement in and shaping of testing policy and practice. Prior to 1958, there was no federal involvement in student testing. In its initial forays into the field, Congress was quick to point out that it would have no authority over test content or who was tested. Gradually, Congress and the U.S. Office of Education and later the Department of Education (established in 1980) would exercise greater and greater authority. Starting with support for administering tests to discover America’s best and brightest in the wake of Sputnik I and then providing simple guidelines for evaluation of a single program (ESEA Title I) in 1965, federal involvement in educational testing has expanded to dictate which students should be tested, when and how they should be tested, and to a large extent the content on which they should be tested. Indeed, the federal government provided funding for development of the most widely used assessments of today: NAEP, PARCC, Smarter Balanced, and consortium tests for special education and ELL students. Beyond those direct inputs, federal mandates have taken measurement problems that were on the minds of a few psychometricians and placed them on center stage. Measurement of change, for example, has been in the literature for some time (cf. Cronbach & Furby, 1970), but without the need to calculate sustained effects (PL 97–35), adequate yearly progress (PL 103–382, PL 107–110), or achievement growth (PL 114–95), how much attention might that problem have received? The modus operandi of the federal government has been to focus attention on selected problems and provide resources for addressing those problems. As was the case in 1917 when America entered World War I, this approach also activated human resources not only to solve the problem at hand but to reshape the testing landscape over the next 50 years. The hundreds of testing professionals employed by Title I Technical Assistance Centers in the 1970s and 1980s became the core staff of testing companies, state assessment programs, and even federal education agencies and bureaus. PARCC, Smarter Balanced, and other consortium tests developed under federal grants touch very nearly every state and district and continue to shape public understanding of testing.

78 Michael B. Bunch

The National Assessment of Educational Progress The preceding section focused primarily on the Elementary and Secondary Education Act of 1965 and its many subsequent amendments and reauthorizations, including brief references to the National Assessment of Educational Progress (NAEP). As NAEP has a rich history of its own innovation and leadership in large-scale assessment, it is the subject of this section. In 1867 Congress established a Department of Education, one of the purposes of which was: collecting such statistics and facts as shall show the condition and progress of education in the several States and Territories, and of diffusing such information respecting the organization and management of schools and school systems, and methods of teaching, as shall aid the people of the United States in the establishment and maintenance of efficient school systems, and otherwise promote the cause of education throughout the country (PL 39–73, Sec. 1) In essence, the primary job of the first U.S. Department of Education was to produce the nation’s report card. The Department was soon downgraded to an office and ultimately folded into the Department of the Interior. It would be a century before a report card would be produced. The National Assessment of Educational Progress was conceived in 1964 and initially funded by the Carnegie Corporation. In 1969, the fledgling national assessment was taken over by the Education Commission of the States (ECS) with continued funding by Carnegie. Over the next several years, the U.S. Office of Education paid an increasingly large portion of its budget, but ECS continued to direct NAEP with mixed Carnegie/federal funding. Congress took over all funding for NAEP in 1972 and officially moved NAEP to the National Institute of Education in 1978, with oversight by an Assessment Policy Committee. The larger history of NAEP is well documented elsewhere (e.g., Vinovskis, 1998; Stedman, 2009). Mazeo, Lazer, & Zieky (2006) provide an excellent technical review of the evolution of NAEP in comparison with other large-scale assessments of similar scope and purpose. This section focuses primarily on the innovations associated with NAEP that became standard practice for the rest of the assessment community and examines how the organization and operations of NAEP have served as a model for other testing programs. As noted in the development of the Army Alpha and Beta tests, the Title I Technical Assistance Centers, consortium tests, and other endeavors, when the federal government gets involved in a project, it does so in a big way, bringing to the table financial and human resources unavailable to smaller government agencies or private companies. Thus, when the U.S. Office of Education returned to the task of “collecting such statistics and facts as shall show the condition and

The Role of the Federal Government 79

progress of education in the several States and Territories,” it did so on a large scale. In 1963, Education Commissioner Francis Keppel contacted Dr. Ralph Tyler, one of the most respected educators and test experts in the country, to help him get the process of a national assessment started. Over time, USOE (and later the Department of Education) would contract with the Research Triangle Institute, Westat, National Computer Systems (NCS), Educational Testing Service (ETS), American College Testing (ACT), Human Resources Research Organization (HumRRO), American Institutes for Research (AIR), and a host of prominent psychometricians and policy analysts to conduct its work and/or evaluate that work. Commissioner Keppel and his successors would continue to seek out the most accomplished and trusted test experts and testing organizations in the country to carry out the work of NAEP, and they had the resources to do so. Tyler’s first suggestion was to test a sample of students rather than all the students in the country. While this approach may seem obvious now, the legislative mandate was ambiguous on this point. Sampling solved several technical, practical, and political problems. It automatically ruled out reporting results on an individual basis (which was strictly forbidden by law) or even school-by-school or district-by-district. Moreover, the sampling plan ultimately agreed upon intentionally ruled out state-by-state comparisons. Testing several thousand students at a few grades would also be much less expensive than testing millions of students at all grades and would permit the application of sophisticated statistical techniques that would produce a reliable and valid report on the state of the nation’s educational attainment, which was the original goal of the legislation. Over time, the NAEP sampling plan became more sophisticated, with balanced incomplete block (BIB) design appearing in 1983 (Beaton & Gonzalez, 1995). The BIB design, sometimes referred to as “spiraling,” divides all the items in a pool of items to be administered into multiple test booklets (usually with overlapping items) and then distributes those booklets over thousands of students. Within a given classroom, each student may take a different form of the test, enhancing the security of the administration and better ensuring that the item statistics are not affected by regional or classroom differences. Results can then be aggregated to produce an overall result for that grade and subject. Because this approach is so common now, it may seem trivial that it was used in 1983, but it was not common then except for field testing of new items. Indeed, the overall design of NAEP took a giant leap forward in 1983. The U.S. Department of Education (ED) issued a request for proposals for design and conduct of NAEP, and ETS was the winning bidder. The ETS design for NAEP (Messick, Beaton, & Lord, 1983) did not just introduce a more sophisticated sampling design; it basically changed every aspect of the testing program and the way in which test developers would interact with test users. One of the most striking innovations was the application of item response theory (IRT) to the construction of the tests and linking of tests across forms

80 Michael B. Bunch

within a given year and across years. IRT had been around for some time (cf. Lord & Novick, 1968), but most commercially available tests were still being constructed and equated using classical item statistics and equipercentile equating of whole test forms. Using IRT on such a large and high-profile project effectively brought it into the mainstream (Linn, 1989). A sophisticated sampling plan, combined with IRT, made it possible to compute plausible values for each student tested (cf. Mislevy & Sheehan, 1987). Although NAEP is legally forbidden to produce individual student results, estimation of student scores is critical to the estimation of population standard error. The ETS approach was to calculate multiple estimates of each student’s ability and the standard error of that estimate based on blocks of items within a larger test. From its inception to the early 1990s, NAEP reported results in terms of average scores on a scale of 0 to 500. Reports also included exemplar items showing what students scoring at benchmarked scale score points knew and could do. However, in 1988, PL 100–297 directed the National Assessment Governing Board to identify “appropriate achievement goals for each age and grade in each subject area to be tested under the National Assessment.” This mandate required that NAGB define proficiency levels and establish cut scores on each of the NAEP tests. Early attempts to set cut scores relied on available standard-setting methodologies; principally, the modified Angoff procedure (Angoff, 1971). Although those early attempts received considerable criticism (cf. U.S. General Accounting Office, 1993), some of the techniques NAEP staff and consultants used inspired many of today’s commonly used standard-setting practices:     

Extended Angoff procedure (Hambleton & Plake, 1995) Bookmark procedure (Lewis, Mitzel, Mercado, & Schulz, 2012) Item descriptor match procedure (Ferarra & Lewis, 2012) Modified body of work procedure (Wyse, Bunch, Deville, & Viger, 2012) Detailed achievement level descriptors (Bourque, 2000)

The early NAEP standard setting activities generated a great deal of heat and eventually a good bit of light. NAGB and NCES cosponsored a conference on standard setting in October 1994 (National Academies of Sciences, Engineering and Medicine, 2017) that produced 19 commissioned papers and led to a flurry of activity in the 1990s and early 2000s that ultimately produced improvements in development of achievement level descriptors, selection of panelists, panelist training and feedback, and overall management of standard setting, in addition to a bumper crop of new standard setting techniques. Throughout its history, NAEP has issued reports, not only on the “condition and progress of education in the several States and Territories,” but on technical innovations and procedures. It has pioneered score reporting formats and continues to push boundaries in that domain. Currently, NAEP provides datasets, technical guidance, and a wide range of technical and nontechnical publications.

The Role of the Federal Government 81

By law, the National Assessment Governing Board has a limited number of professional staff. However, nationwide administration of the various NAEP tests keeps a small army of professionals busy. For the past seven cycles (2003–2015) all fifty states have participated in the reading and mathematics assessments. To participate, each state must have a NAEP coordinator, and every participating school must have a school coordinator. NAEP representatives, state coordinators, and school coordinators constitute a considerable workforce. As much as any technical advances made by NAEP, it is this close interaction with educators and test specialists in every state that has made NAEP the nation’s report card. NAEP has continued to set the pace for large-scale assessment. Yen & Fitzpatrick (2006), for example, refer to NAEP as “the most famous example of a matrixsampled test…NAEP is also notable for the use of IRT in combination with advances in missing data technology and hierarchical analyses to estimate population characteristics without estimating individual examinee scores” (p. 145). NAEP is cited in five separate chapters of Educational Measurement 3rd Edition (Linn, 1989) and in eight separate chapters in Educational Measurement 4th Edition (Brennan, 2006). In both editions, the editors cite the significant contributions of NAEP to the fundamentals of modern psychometrics and test program administration. One distinct advantage enjoyed by NAEP is the stability of its management. While positions on the National Assessment Governing Board rotate on a legislatively prescribed basis, the overall direction of NAEP has remained remarkably stable for the past 35 years. Few state assessment programs or other large-scale testing programs can make such a claim. While NAEP has continued to be at the forefront of technological and psychometric advances, its basic focus has remained unchanged. Given this stability, NAEP can take the time necessary to create and deliver defensible tests, as shown in a typical timetable:       

Develop the assessment framework (2 years) Create the assessment (2–5 years) Select the participants (13 months) Prepare, package, and deliver materials for assessment day (8 months) Administer the assessment (3 months) Score the assessment (4–6 months) Analyze and report the results (2–6 months)

Obviously, with no individual student results to report, NAEP can take its time in analyzing and reporting results. Testing programs with student, school, and district reporting requirements do not have that luxury. It is with regard to the front-end tasks, however, where the oversight stability of NAEP shines. Change in leadership (with concomitant change in test direction or philosophy) is so frequent in most large-scale testing programs that they do not realistically have two years to develop a test framework or five years to develop a single test. But NAEP has shown that it is possible.

82 Michael B. Bunch

Summary and Conclusions Over the past century, the federal government has steadily increased its role in the shaping of educational testing policy and practice. Sometimes that role has been subtle, as in the funding of a short-term project in 1917 to create group ability tests. Those tests became the driving force of a national intelligence testing movement that would last for generations. In an attempt to close the education gap with the Russians in 1958, the federal government once again looked to testing and left leadership to testing experts outside the government as it had done in 1917. Another crisis in 1965, this one internal, led to a series of federal laws that would become increasingly specific and directive over the next fifty years. The Elementary and Secondary Education Act of 1965 – and all its subsequent reauthorizations, amendments, and companion laws – ushered in an era of increasing federal oversight and direct influence on educational testing, culminating in specific direction as to what would be tested, when and how it would be tested, and who would take the tests. As its authority over testing was increasing, the federal government was also providing considerable technical and technological leadership in the field. The Title I Technical Assistance Centers of the 1970s and ’80s helped to standardize test selection, administration, and use throughout the nation. The National Assessment of Educational Progress – The Nation’s Report Card – has set standards for test design, administration, analysis, and reporting for decades. Who could have predicted a century ago that a relatively small project directed by APA president Robert Yerkes and funded by the Surgeon General of the United States would have led to the current state of educational testing in America? As Harold Hand (1965) pointed out after the first administration of the National Assessment of Educational Progress, the camel had gotten its nose under the tent. In the years since, the camel has gradually gotten its neck, body, and tail into the tent and is helping to run the circus.

Note 1 The author gratefully acknowledges the helpful review by Dr. Joe McClintock of an earlier draft of this chapter.

References Allen, N. L., Carlson, J. E., and Zelenak, C. A. (1999). The NAEP 1996 technical report, NCES 1999–452. Washington, DC: National Center for Educational Statistics. Anderson, J. K., Johnson, R. T., Fishbein, R. L., Stonehill, R. M., and Burns, J. C. (1978). The U. S. Office of Education models to evaluate E. S. E. A. Title I: Experiences after one year of use. Washington, DC: Office of Planning, Budgeting, and Evaluation, U. S. Office of Education.

The Role of the Federal Government 83

Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.), Educational measurement (2nd Ed.). Washington, DC: American Council on Education. Beaton, A. E., and Gonzalez, E. (1995). NAEP primer. Chestnut Hill, MA: Boston College. ERIC Document Reproduction Service ED 404 374. Bingham, W. V. (1937). Aptitudes and aptitude testing. New York: Harper & Brothers. Bourque, M. L. (2000). Setting performance standards: The role of achievement level descriptors in the standard setting process. Paper presented at the annual meeting of the National Council on Measurement in Education, New Orleans, LA. Brennan, R. L. (2006). Perspectives on the evolution and future of educational measurement. In R. L. Brennan (Ed.), Educational measurement (4th Ed.). Westport, CT: Praeger. Carter, L. F. (1983). A study of compensatory and elementary education: The sustaining effects study. Final report. Washington, DC: Office of Program Evaluation. ERIC Document Reproduction Service ED246 991. Cizek, G. J. (2005). Special issue: Vertically moderated standard setting. Applied Measurement in Education, 18 (1). Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.), Educational measurement (2nd Ed.). Washington, DC: American Council on Education. Cronbach, L. J. and Furby, L. (1970). How should we measure change, or should we? Psychological Bulletin, 74, (1), 68–80. Darwin, C. (1859). On the origin of species. London: John Murray; Electronic Classic Series Publication, J. Manis (Ed.). Hazleton, PA: The Pennsylvania State University. Debra P. v. Turlington 644 F.2d 397 (5th Cir. 1981). DuBois, P. H. (1970). A history of psychological testing. Boston, MA: Allyn and Bacon. Ferrara, S., and Lewis, D. (2012). The item descriptor (ID) match method. In G. J. Cizek (Ed.), Setting performance standards: Foundations, methods, and innovations (2nd Ed.). New York: Routledge. Goddard, H. H. (1908). The Binet and Simon tests of intellectual capacity. Vineland, NJ: The Training School. Gould, S. J. (1981). The Mismeasure of Man. New York: Norton. Hambleton, R., and Plake, B. (1995). Using an extended Angoff procedure to set standards on complex performance assessments. Applied measurement in education, 8, 41–56. Hand, H. (1965). National assessment viewed as the camel’s nose. Kappan, 1, 8–12. Kelly, F. J. (1915). The Kansas silent reading test. Topeka, KS: Kansas State Printing Plant. Kelley, T. L., Rice, G. M., and Terman, L. M. (1922). Stanford achievement test. Yonkers on Hudson, NY: World Book. Lau v. Nichols, 414 U.S. 563 (1974). Lewis, D. M., Mitzel, H. C., Mercado, R. L. and Shulz, E. M. (2012). The bookmark standard setting procedure. In G. J. Cizek (Ed.), Setting performance standards: Foundations, methods, and innovations (2nd Ed.). New York: Routledge. Linn, R. L. (1989). Current perspectives and future directions. In R. L. Linn (Ed.), Educational measurement (3rd Ed.). New York: Macmillan. Lord, F. M., and Novick M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. Mazeo, J., Lazer, S., and Zieky, M. J. (2006). Monitoring educational progress with groupscore assessments. In R. L. Brennan (Ed.), Educational measurement (4th Ed.). Westport, CT: Praeger. McLaughlin, M. W. (1974). Evaluation and reform: The Elementary and Secondary Education Act of 1965, Title I. Santa Monica, CA: Rand.

84 Michael B. Bunch

Mislevy, R. J. & Sheehan, K. M. (1987). Marginal estimation procedures. In A. E. Beaton (Ed), The NAEP 1983–1984 Technical Report (pp. 293–360). Report No. 15-TR-20. Princeton, NJ: Educational Testing Service. National Academies of Sciences, Engineering, and Medicine (2017). Evaluation of the achievement levels for mathematics and reading on the National Assessment of Educational Progress. Washington, DC: The National Academies Press. doi:10.17226/23409.. National Governors Association Center for Best Practices, Council of Chief State School Officers (2010a). Common Core State Standards: English Language Arts. Washington DC: Author. National Governors Association Center for Best Practices, Council of Chief State School Officers (2010b). Common Core State Standards: Mathematics. Washington DC: Author. Office of the Assistant Secretary for Planning and Evaluation (1972). A common thread of service. A history of the Department of Health, Education, and Welfare. Retrieved 04/25/19 from https://aspe.hhs.gov/report/common-thread-service/history-department-healtheducation-and-welfare. Otis, A. S. (1921). Otis group intelligence scale manual of directions for primary and advanced examinations: 1921 revision. Yonkers on Hudson, NY: World Book. Retrieved 4/29/19 from https://babel.hathitrust.org/cgi/imgsrv/image?id=chi.086522226. Press Office, U.S. Department of Education (2010). U.S. Secretary of Education Duncan Announces Winners of Competition to Improve Student Assessments. Press Release September 10, 2010. Reed, J. (1987). Robert M. Yerkes and the mental testing movement. In M. M. Sokal (Ed.), Psychological testing and American society: 1890–1930. Rutgers, NJ: Rutgers University Press. Samelson, F. (1987). Was early mental testing a) racist inspired, b) objective science, c) a technology for democracy, d) the origin of multiple-choice exams, e) none of the above? In M. M. Sokal (Ed.), Psychological testing and American society: 1890–1930. Rutgers, NJ: Rutgers University Press. Stedman, L. C. (2009). The NAEP long-term trend assessment: A review of its transformation, use, and findings. (Paper Commissioned for the 20th Anniversary of the National Assessment Governing Board 1988–2008) ERIC Document Reproduction Service ED 509 383. Stonehill, R. M., and Anderson J. I. (1982). An Evaluation of ESEA Title I–Program Operations and Educational Effects. A Report to Congress. Washington, DC. Office of Planning, Budget, and Evaluation, U.S. Department of Education. Tallmadge, G. K., and Wood C. T. (1976). User’s guide: ESEA Title I evaluation and reporting system. Mountain View, CA: RMC Research Corporation. Taylor, F. W. (1915). The principles of scientific management. New York: Harper. Terman, L. M. (1916). The measurement of intelligence: An explanation of and a complete guide for the use of the Stanford revision and extension of the Binet-Simon intelligence scale. Cambridge, MA: Riverside. Terman, L. M. (1920). Terman group test of mental abilities for grades 7 to 12. Yonkers on Hudson, NY: World Book. Terman, L. M. (1925). Genetic studies of genius. Stanford, CA: Stanford University Press. Thorndike, E. L. (1919). Scientific personnel work in the army. Science, 49, (1255), 53–61. Thurlow, M. L., Lazarus, S. S., Christensen, L. L., & Shyyan, V. (2016). Principles and characteristics of inclusive assessment systems in a changing assessment landscape (NCEO Report 400). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes.

The Role of the Federal Government 85

United States (1867). An act to establish a Department of Education: 39th Cong., 2nd sess., Public law 39–73. United States (1958). National defense education act of 1958: H. R. 13247, 85th Cong., 2nd sess., Public law 85–864. Reports, bills, debate and act. [Washington]: [U.S. Govt. Print. Off.]. United States (1965). Elementary and secondary education act of 1965: H. R. 2362, 89th Cong., 1st sess., Public law 89–10. Reports, bills, debate and act. [Washington]: [U.S. Govt. Print. Off.]. United States (1966). Elementary and secondary education amendments of 1966: H. R. 13161, 89th Cong., 2nd sess., Public law 89–750. Reports, bills, debate and act. [Washington]: [U.S. Govt. Print. Off.]. United States (1967). Elementary and secondary education amendments of 1967: H. R. 7819, 90th Cong., 1st sess., Public law 90–247. Reports, bills, debate and act. [Washington]: [U.S. Govt. Print. Off.]. United States (1974). Education amendments of 1974: H. R. 69, 93rd Cong., 1st sess., Public law 93–380. Reports, bills, debate and act. [Washington]: [U.S. Govt. Print. Off.]. United States (1975). Education of all handicapped children act of 1975: S. 6, 94th Cong., 1st sess., Public law 94–142. Reports, bills, debate and act. [Washington]: [U.S. Govt. Print. Off.]. United States (1978). Education amendments of 1978: HR 15, 95th Cong., 2nd sess., Public law 95–561. Reports, bills, debate and act. [Washington]: [U.S. Govt. Print. Off.]. United States (1981). Omnibus budget reconciliation act of 1981: H. R. 3982, 97th Cong., 1st sess., Public law 97–35. Reports, bills, debate and act. [Washington]: [U.S. Govt. Print. Off.]. United States (1988). Augustus F. Hawkins-Robert T. Stafford Elementary and Secondary School Improvement Amendments of 1988: H.R. 5, 100th Cong., 2nd sess., Public law 100–297. Reports, bills, debate and act. [Washington]: [U.S. Govt. Print. Off.]. United States (1994). Goals 2000 Educate America Act: HR 1804, 103rd Cong. 2nd sess. Public Law 103–227. Reports, bills, debate and act. [Washington]: [U.S. Govt. Print. Off.]. United States (1994). Improving America’s Schools Act: HR 6, 103rd Cong., 2nd sess. Public Law 103–382. Reports, bills, debate and act. [Washington]: [U.S. Govt. Print. Off.]. United States (1997). Individuals with disabilities education act of 1997: H. R. 1350, 108th Cong., 1st sess., Public law 108–446. Reports, bills, debate and act. [Washington]: [U.S. Govt. Print. Off.]. United States (2002). No child left behind act of 2002: H. R. 1, 107th Cong., 1st sess., Public law 107–110. Reports, bills, debate and act. [Washington]: [U.S. Govt. Print. Off.]. United States (2015). Every student succeeds act of 2015: H. R. S1177, 114th Cong., 1st sess., Public law 114–195. Reports, bills, debate and act. [Washington]: [U.S. Govt. Print. Off.]. United States Department of Commerce (1975). Historical statistics of the United States: Colonial times to 1970. Washington, DC: Bureau of the Census. United States General Accounting Office (1993). Educational achievement standards: NAGB’s approach yields misleading interpretations. Washington, DC: Author.

86 Michael B. Bunch

United States National Commission on Excellence in Education (1983). A nation at risk: The imperative for educational reform: a report to the nation and the Secretary of Education, United States Department of Education. Washington, DC: The Commission: [Supt. of Docs., U.S. G.P.O. distributor]. Unsigned (1919). The National Research Council: Organization of the National Research Council. Science New Series 49, (1272), 458–462. Retrieved 11/21/16 from http:// www.jstor.org/stable/1642812. Vinovskis, M. A. (1998). Overseeing the nation’s report card: The creation and evolution of the National Assessment Governing Board. Paper prepared for the National Assessment Governing Board. Ann Arbor, MI: University of Michigan. Retrieved 5/7/19 from https:// www.nagb.gov/assets/documents/publications/95222.pdf. Wechsler, D. (1939). Wechsler-Bellevue intelligence scale. New York: Psychological Corporation. Whipple, G. M. (1910). Manual of mental and physical tests: A book of directions compiled with special reference to the experimental study of school children in the laboratory or classroom. Baltimore, MD: Warwick & York. Retrieved 4/29/19 from https://babel.hathitrust.org/ cgi/pt?id=loc.ark:/13960/t0000rq1s. Whipple, G. M. (1921). The national intelligence tests. The Journal of Educational Research, 4, (1), 16–31. Retrieved 4/29/19 from https://www.jstor.org/stable/27524498. Wyse, A. E., Bunch, M. B., Deville, C., and Viger, S. (2012). A modified body of work standard-setting method with construct maps. Paper presented at the 2012Meeting of the National Council of Measurement in Education in Vancouver British Columbia. Yen, W., and Fitzpatrick, A. R. (2006). Item response theory. In R. L. Brennan (Ed.), Educational measurement (4th Ed.). Westport, CT: Praeger. Zenderland, L. A. (1987). The debate over diagnosis: Henry Goddard and the medical acceptance of intelligence. In M. M. Sokal (Ed.), Psychological testing and American society: 1890–1930. Rutgers, NJ: Rutgers University Press.

5 HISTORICAL MILESTONES IN THE ASSESSMENT OF ENGLISH LEARNERS Jamal Abedi and Cecilia Sanchez1

Introduction The academic assessment of students, including English learners (ELs), in the United States dates back more than a century. The practice of large-scale assessment began with intelligence testing in the 1910s (cf. Mclean, 1995). Emphasis on intelligence testing quickly shifted to academic achievement testing by the 1930s (U.S. Congress, 1992, pp. 120–124). In previous decades, ELs simply took whatever tests other students were taking. Not until the passage of the Bilingual Education Act in 1967 and a series of federal regulations were the needs of ELs truly starting to be addressed across the nation. By the 1990s, achievement tests closely related to content taught in school were developed, which increased the validity of assessments. Currently, accommodation research is underway to determine the most effective accommodations for ELs. Today, there are approximately five million ELs in grades K–12 in the country, representing approximately 10 percent of all K–12 students (NCES, 2020a). ELs remain the fastest-growing subgroup of students in the nation. Much EL student growth has been concentrated in specific states. In 2016, for example, California’s EL population was 20.2 percent, while West Virginia’s was less than 1 percent (McFarland et al., 2019). Although ELs’ native languages are diverse, the largest group is Spanish speaking. As of 2017, 80 percent or more of ELs in 11 states, including California, Texas, and Arizona, had a Spanish language background (U.S. Department of Education OELA, 2019). Regardless of demographics, ELs’ academic achievement has continually lagged behind non-EL students. For example, on the 2019 National Assessment of Educational Progress (NAEP) fourth grade reading test, 69 percent of ELs versus 29 percent of non-ELs scored

88 Jamal Abedi and Cecilia Sanchez

below basic (NCES, 2020b). To better understand this dilemma and the accountability efforts to address it, this chapter summarizes EL assessment history and identifies major milestones in EL assessment. In the chapter we discuss the following key historical periods and topics as major influences on ELs’ assessment:     

Intelligence Testing Transition into Achievement Testing Federal Acts and Legislations Content-based Assessments Accommodation Influence a b



Accommodation Assignment Accommodation Techniques

EL Students with Learning Disabilities

Intelligence Testing During the 1910s intelligence tests were first used for Army job placement and later transitioned into intelligence measurement tools for children in school (cf. Bunch, Ch. 4, this volume). Through a series of revisions to the Binet-Simon intelligence scale by Lewis Terman in 1912, the test was renamed the Stanford-Binet scale and was administered to thousands of students in the late 1910s (U.S. Congress, 1992, p. 118). Educational assessment experts quickly adopted what appeared to be an objective measure of student ability and used the scores to make decisions about student academic progress and classroom placement regardless of students’ level of English language comprehension.

Theory and Concept of Intelligence In the early 20th century, intelligence was primarily viewed as the result of heredity, predetermined by genes rather than based on nurture and environment (i.e., influences from parenting styles, quality of education, support from teachers and educational leaders, poverty levels, etc.). Herrnstein and Murray (1994) highlighted racial differences in IQ scores and argued that variation in IQ scores among races was due to genetics. This idea was supported by intelligence test scores, which often placed minority students (including ELs) at a lower intellectual level than other students (Wade, 1980). Kamin (1974) and others opposed such views and suggested that heredity might account for little if any contribution to intelligence. Nevertheless, driven by policymakers’ desire to measure learning and the tests’ apparent validity, the use of intelligence tests to measure academic achievement persisted well into the 20th century.

The Assessment of English Learners 89

Issues with Intelligence Testing Numerous definitions of intelligence developed, but with little common agreement and no agreed-upon standards. Consequently, early intelligence tests often measured different components of intelligence. Formats, although primarily written, also varied and could be pictorial, verbal, or mixed. Performance on one intelligence test did not mean similar performance on a second test, nor could a single intelligence test measure a full spectrum of intelligence (Zajda, 2019). Currently it is widely known that early versions of intelligence tests contained various forms of bias that caused minority groups (including ELs) to score lower. In particular, intelligence tests were challenging for ELs due to unfamiliarity with American culture and limited English proficiency. Familiarity with American culture greatly influenced test scores in early forms of intelligence testing. As explained by Vernon (1969), in one society students taking the Raven’s intelligence test had a mean score that yielded an average IQ of 75. When the society was further analyzed, it was found that students in the society did not work with geometric shapes; thus, the students’ scores were greatly hindered by their inexperience with the shapes. In another study, also included in Vernon’s book, signs of cultural bias are more evident. Researchers found that as students got older the performance gaps between Black and White Americans widened, in that Black students performed two months behind White students in second grade and 14 months behind in sixth grade. Vernon (1969) specifies that the IQ of Black students was measured to be 86 in fiveyear-old students and 65 in 13-year-old students. By the mid-1970s the federal appeals court found several indications of cultural bias in intelligence tests used to place students into special education classrooms (Wade, 1980). The ruling in the Larry P. v. Riles case banned the use of selected intelligence tests in public education (Wade, 1980). Although the case focused on Black students, it also brought attention to the disproportionate number of minority students (including ELs) in special education and the need to create assessments that are not culturally biased. A further issue with intelligence tests was their use for classification purposes. A low score on an intelligence test was often interpreted as a lack of educational capability and could result in a student being labeled as cognitively disabled (Mclean, 1995; Tyack, 1974). In one Arizona school district, for example, the superintendent placed Mexican American students with low intelligence scores into a special vocational program, thereby segregating them from all other students and limiting their opportunities to learn (Cremin, 1961; Tyack, 1974). For ELs, the use of the test; i.e., placement into a cognitively disabled category, was worse than the test itself. Braden (2000) warned that intelligence test scores that claim to assess specific parts of intelligence should not be used to make larger implications about overall intelligence. This largely applies to ELs, since due to the language factors students

90 Jamal Abedi and Cecilia Sanchez

may only have access to non-verbal formats, thus educators may have drawn incomplete conclusions on overall intelligence from non-verbal formats. A further issue was that most intelligence tests attempted to measure aptitude or students’ ability to learn, whereas schools and policymakers were often focused on what students had learned in school (Linn & Gronlund, 1995). To illustrate the importance of test format, purpose, and use, the lead author and colleagues gathered state assessment scores in mathematics and English Language Arts (ELA) from fourth grade students (391 EL and 485 non-EL). The same students also took the Raven’s Coloured Progressive Matrices (CPM) test as a measure of intellectual capacity (Raven, 2003). Raven’s CPM is a geometric pictorial test considered as a language- and culture-free measure of intelligence. Raven’s was selected in place of other intelligence tests because of its lower language demand, making it easier to distinguish between lack of English proficiency and cognitive factors. Table 5.1 demonstrates the use of intelligence test scores in determining student academic skills. The results show that the two content assessment scores (mathematics and ELA) are strongly correlated (r = .778), suggesting that academic content measures share a substantial amount of common variance. However, the correlations between the two content measures with Raven’s were much weaker. For example, the correlation between the state mathematics score and Raven’s was .383, explaining about 14 percent of the mathematics assessment score variance. Similarly, the correlation between the state ELA score and Raven’s is low (r = .302). These data illustrate the major differences between an intelligence test claiming to measure ability versus content tests that measure school-taught lessons. They also suggest the misuse of intelligence tests during the early 1900s as measures of student learning. ELs who were attempting to learn a new language in addition to new content were undoubtedly further disadvantaged. TABLE 5.1 Correlation between state assessment scores and Raven test score

Assessments

Measures

State Mathematics Score & State ELA Score

Correlation Sig. (2-tailed) N Correlation Sig. (2-tailed) N Correlation Sig. (2-tailed) N

Raven Test Score & State Mathematics Score Raven Test Score & State ELA Score

** Correlation is significant at the 0.01 level (2-tailed).

.778** .000 849 .383** .000 852 .302** .000 851

The Assessment of English Learners 91

According to Lohman et al. (2008), ELs continue to be outperformed by non-EL students, despite the availability of non-verbal intelligence tests. While there may be countless factors to attribute the performance gap in intelligence testing between ELs and non-ELs, in the study mentioned, researchers controlled for students’ age, socioeconomic status (SES), and other demographic factors. Braden (2000) states that although non-verbal intelligence tests greatly reduce language demand, they still may cause students to utilize “linguistic or knowledge-mediated strategies to comprehend, process, and respond to test items.” Thus, there may be underlying language demands in non-verbal intelligence tests that explain the performance gap between the two groups. However, when we compared the scores of EL students to non-ELs we found that both groups scored nearly the same on Raven’s CPM (see Table 5.2). There are many potential reasons for the differences between the data we collected and the study by Lohman et al. (2008) including study design and demographic features.

Transition to Achievement Testing United States achievement testing dates back to the 19th century when a surge of immigrants led to a transition from student oral exams to written tests with the goal of reducing time spent on test administration (U.S. Congress, 1992, p. 103). However, statewide assessments such as the Iowa Test of Basic Skills (ITBS), began several decades later, in 1929 (Hutt & Schneider, 2017). ITBS greatly influenced educators’ preference for achievement tests rather than intelligence tests by measuring students’ learning in school-taught subjects. The administration and scoring of ITBS were standardized and norm-referenced, allowing educators and policymakers to compare students and schools on a broad basis. ITBS and similar achievement tests could be used to hold schools accountable for student learning. By the 1940s, achievement testing became even more widely accepted because it proved to assess various dimensions of academic ability (U.S. Congress, 1992, p. 127). As with intelligence tests, early achievement tests were not modified for ELs, nor were accommodations provided.

Issues with Achievement Testing The achievement tests still led to student misclassification (Hutt & Schneider, 2017). The misclassification could be attributed to faulty norming processes, since

TABLE 5.2 Raven CPM mean scores for EL and non-ELs

EL Group

Mean

N

Std. Deviation

Non-EL EL Total

29.18 29.23 29.20

485 391 876

5.697 5.139 5.452

92 Jamal Abedi and Cecilia Sanchez

they did not necessarily include ELs (Abedi & Gándara, 2006). In fact, ELs were generally not recognized as a subgroup until 1994 (LaCelle-Peterson & Rivera, 1994). Thus, many ELs were once more incorrectly placed into disability classrooms. A further problem was a lack of uniformity between content taught in schools and content tested on exams (U.S. Congress, 1992, p. 165), especially as new achievement tests were developed by different test publishers. Achievement tests also were used for unintended purposes such as the ranking of schools (Hutt & Schneider, 2017). The U.S. Congress (1992, p. 165) stated, “tests that are going to be used for selection should be designed and validated for that purpose.” Since the goal of early achievement tests was solely to measure what students learned in school, using the same test to make other inferences or make other educational decisions, such as disability classroom placement, required more test validation.

Federal Acts and Legislation Beginning with Brown v. Board of Education in 1954, which ruled out classroom segregation based on race, the federal government began to play a key role in emphasizing equality in education for students with various backgrounds (Cramer et al., 2018). Attention eventually focused on the instruction, progress monitoring, and assessment of EL students. Since then federal legislation established criteria for assessing students’ academic performance and increasing schools’ accountability. We discuss the most relevant federal laws here.

Elementary and Secondary Education Act (ESEA), 1965 Prior to 1965, federal educational language policies regarding EL students were neither clear nor objective (Wright, 2005). The assessment of bilingual2 students gained attention in the 1960s with the Elementary and Secondary Education Act (Zascavage, 2010). Within the 1967 reauthorization of ESEA, Title VII (the Bilingual Education Act), focused on meeting the needs of EL students (StewnerManzanares, 1988). The main goal was, and continues to be, the inclusion of EL students in English-proficient classrooms and assessments (U.S. Department of Education OELA, 2017). Importantly, while the Bilingual Education Act contained the word “bilingual,” it by no means enforced bilingualism in education (García et al., 2008). Instead, ESEA allowed state education leaders to decide whether or not to include bilingual programs in their states. Wright (2005, p. 2) wrote: In most cases, schools ignored the needs of language minority students and simply placed them in English immersion or “sink-or-swim” programs. In the wake of the Civil Rights Movement culminating in the passage of the 1964 Civil Rights Act (Title VI), and the War on Poverty, educators and

The Assessment of English Learners 93

policy makers became more sensitive to the needs of their rapidly growing language minority student population. Nevertheless, ESEA with its Bilingual Education Act, which included important funding for ELs in public schools, was a milestone because it brought attention to both the education and assessment of EL students.

Equal Education Opportunity Act, 1974 According to Sutton et al. (2012), the Equal Education Opportunity Act of 1974 demanded equal education for all students, including ELs. The act provided a clearer set of operational definitions and educational principles, which increased federal funding for English as a Second Language (ESL) programs to a broader group of EL students (Stewner-Manzanares, 1988). This act focused on teaching students English as quickly as possible rather than improving both the students’ first language and English language skills (Wright, 2005). A monumental legal case was Lau v. Nichols in 1974. A school district in San Francisco, California placed Chinese ELs into classrooms without helping them attain the necessary English skills to render learning meaningful (StewnerManzanares, 1988). A U.S. Supreme Court ruling against the school district resulted in regulations requiring schools to provide EL students with ESL assistance. The “Lau Remedies” (i.e., the outcomes of the Lau case), provided a clear description of EL identification, curriculum, and level of proficiency needed in order for ELs to participate in English-only classes, as well as professional standards for EL teachers (Wright, 2005). The Lau Remedies also distinguished between bilingual education for students in elementary school and ESL classes for those in junior or high school (García et al., 2008). While never officially published as regulations (Stewner-Manzanares, 1988), the Lau Remedies were an important step toward equal education opportunity for ELs. Although the remedies did not specifically address EL assessment, their support of the Equal Opportunity Act and ESL programs would soon promote “continuous assessment to determine if students’ English language deficits are being addressed.” (Sutton et al., 2012)

Reauthorizations: 1978–1994 A series of ESEA reauthorizations from the late 1970s to the 1990s targeted specific issues facing ELs in education. In the 1978 Education Amendments, limited English proficiency was defined as difficulties in reading, writing, and speaking at a proficient level of English (Stewner-Manzanares, 1988). The 1984 Bilingual Education Act required the federal government to provide equal educational opportunities for ELs and provide services to enable ELs to participate in Englishonly instruction and assessments (Stewner-Manzanares, 1988).

94 Jamal Abedi and Cecilia Sanchez

Simultaneously, the 1984 Act promoted the use of English-only programs to increase EL English proficiency levels. The 1988 Bilingual Education Act reauthorization of federal funding for classes to support ELs increased but limited the number of years that ELs could participate in ESL programs (Cubillos, 1988). The 1994 Bilingual Education Act reauthorization expanded the definition of ELs to include those from migrant backgrounds; thus, more students were included in ESL programs (Wright, 2005). Notably, the 1994 Act represented a major policy shift, encouraging EL students to maintain their native language while developing English skills (Wiese & Garcia, 2001). Amendments to the Bilingual Act required that schools administer an annual English proficiency exam to monitor ELs’ progress and determine EL status (Wiese & Garcia, 2001). ELs could not exit the program until proficiency in speaking, listening, reading, and writing on an ELP assessment was reached (U.S. Department of Education, 2017). Overall, ESEA revisions during the late 20th century increased the inclusivity and assessment of ELs in ESL programs with a goal of improving EL language proficiency. At the time, EL inclusion in state achievement tests was limited to ELs who had been in the United States for at least three years (Wiese & Garcia, 2001). The purpose of including ELs in statewide assessments was so that schools would be held accountable for ELs’ progress in content areas like mathematics and ELA, not just ELP. However, Wiese and Garcia (2001) also point out that ELs were not given accommodations to make these tests more accessible; consequently, an issue with including ELs in the assessments is that statewide assessments in the United States not only measure the topic of the test, but also the English skills needed to take the test. This is an issue that became apparent in intelligence testing and continues to resurface.

No Child Left Behind (NCLB), 2002 No Child Left Behind (2002) was a major ESEA reauthorization mandating annual state mathematics and ELA testing of nearly all students, including ELs (De Cohen & Clewell, 2007). ELs in the United States for less than one year could be exempted from the state assessments, thereby tightening the previous three-year exemption (Willner et al., 2009). A key purpose of the act was to increase state accountability for the progress of ELs during their academic careers (Calderon, 2015). Title III of the NCLB required that schools help ELs meet content standards covered in the state assessments and kept schools accountable by tracking ELs’ progress (Abedi, 2004; Millard, 2015). Particularly, NCLB consisted of three annual measurable achievement outcomes (AMAO) including: AMAO 1, which measured changes in student’s English language proficiency assessment scores; AMAO 2, which compared English language proficiency levels to the state’s required level in order to consider a student proficient in English; and AMAO 3,

The Assessment of English Learners 95

which compared actual EL progress to the state’s overall yearly progress (Anderson & Dufford-Meléndez, 2011). However, Anderson and Dufford-Meléndez (2011) found substantial variation in assessments and calculations used to determine whether or not AMAOs were met across six states, thus making comparisons across the nation challenging. The AMAOs also put a considerable amount of strain on schools to prove that their students were making sufficient progress, whereby four consecutive years of failed attempts to meet the three AMAOs resulted in rigorous interventions with the schools (Tanenbaum & Anderson, 2010). As Tanenbaum and Anderson (2010) explain, some of the interventions included action plan development, instruction modifications, staff replacements, and could even result in a loss of funding. Overall, the implementation of NCLB had mixed effects on the growing EL population and schools. For example, NCLB increased attention on EL students’ progress, set higher standards of achievement, and “increased the alignment of curriculum, instruction, professional development, and testing” (De Cohen & Clewell, 2007). However, Gándara and Baca (2008) report that NCLB resulted in problematic consequences for schools with high numbers of ELs, particularly because these schools tended to report lower levels of progress. Christensen (2010) found that ELs often performed “below proficient” on state assessments mandated by NCLB. Researchers posited that state assessments did not take into account ELs limited English proficiency, thereby causing performance gaps between ELs and non-ELs (Christensen, 2010; Mojica, 2013). Christensen’s (2010) study found that peer reviewers in the federal evaluation of states’ standards and assessments often reported that state assessments were not fully accessible to ELs. Prior to the implementation of NCLB, several states had passed propositions which mandated that education in the United States only be in English (Wright, 2005). Specifically, in 1997 California voters passed “English for the Children Initiative” which limited classrooms to English-only instruction (Wiese & Garcia, 2001). NCLB updated the title of the “Bilingual Act” to “Language Instruction for Limited English Proficient and Immigrant Students,” which illustrates the government’s move away from the use of bilingualism in public education (García et al., 2008). This resulted in less support for bilingual education, dualimmersion classrooms, and a lack of variability in the language of instruction and assessments. García et al. (2008) stated that by removing bilingual education from schools NCLB denied access to tools that could have decreased the performance gap between EL and non-EL students.

Every Student Succeeds Act, 2015 Following the NCLB era, the Every Student Succeeds Act (ESSA) of 2015 mandated that all schools implement standards for ELs to achieve English-

96 Jamal Abedi and Cecilia Sanchez

language proficiency. Previously, Title III of NCLB only mandated that schools receiving government funding for their EL programs be held accountable (Achieve & UnidosUS, 2018). With ESSA, states had more power to determine how to improve the performance of schools with low proficiency scores and help more ELs reach English proficiency (Rentner et al., 2019). Several state consortia developed ELP assessments. For example, as of 2018, 36 states were using the ELP assessment developed by the WIDA Consortium, seven different states were using ELPA21, two states were using LAS Links, and the remaining states developed and used their own ELP assessments (Achieve & UnidosUS, 2018). As with NCLB, states continue to utilize yearly ELP assessments to track student progress and hold schools accountable for EL student achievement. Additionally, ELP assessment scores continue to determine EL status, which regulates whether or not the student receives EL services. ESSA further increased the number of students required to participate in statewide testing by including “important protections to ensure that all students are tested, offered appropriate accommodations when needed, and held to the same high standards” (U.S. Department of Education, 2017). As was the case with NCLB, recently arrived ELs were not required to take state tests during their first year in the country. However, schools now have the option to allow recently arrived ELs to participate in state testing and not report the scores to the state, and instead use them to track student progress (UnidosUS, 2018). According to the Center on Standards and Assessments Implementation (CSAI, 2019), ESSA allows states to decide when to provide assessments in languages other than English; however, these assessments must meet federal guidelines to ensure quality. The decision to include state assessments in non-English languages is determined by the percentage of students that speak the same native language (UnidosUS, 2018). For example, California decided to include Spanish assessments because over 15 percent of their K–12 students speak Spanish (UnidosUS, 2018). In essence, ESSA is meant to provide more support for ELs in state assessments.

Content-based Assessments In 1989, the National Council of Teachers of Mathematics published the first national mathematics standards (Romberg, 1993). These were soon followed by standards in nearly all content areas. The content standards emphasized that all students, including ELs, must be able to master the standards. In 1989, President Bush and all 50 state governors established a set of national education goals called America 2000, later implemented by the Clinton administration as Goals 2000 (Klein, 2014). Naturally, the standards required assessments to measure state progress. Thus began a rapid increase of content-based testing that continued throughout the development of the Common Core State Standards (CCSS) in 2009 (Friedberg et al., 2018). What became increasingly clear; however, was that

The Assessment of English Learners 97

neither new standards nor new assessments diminished the performance gap between ELs and non-ELs. For example, the 2019 NAEP report states that eighth grade EL student mathematics performance was 42 points lower than non-EL students (Nation’s Report Card, 2020).

Issues with content-based assessments Researchers have noted that ELs systematically underperform, compared to their non-EL peers, on nearly all types of assessments (Abedi & Gándara, 2006; Mojica, 2013). A potential explanation for the performance gap is cultural bias. The lingering effects of cultural bias in assessments have been apparent since intelligence testing in the early 20th century, as mentioned earlier. However, despite the efforts taken to reduce cultural bias, all forms of bias may not be truly eliminated from assessments (Kruse, 2016). As summarized by Kim and Zabelina (2015), standardized tests are “norm based on the knowledge and values of the majority groups, which can create bias against minority groups,” which leads to an inadequate representation of ELs in the norming group and calls into question the validity and reliability of the tests. Specifically, language background affects test interpretation, word meaning, and sentence structure (for specific examples see, Kim & Zabelina, 2015). Language factors are further influenced by culture including “values, beliefs, experiences, communication patterns, teaching and learning styles, and epistemologies of their cultures and societies” (Kim & Zabelina, 2015). Kruse (2016) mentioned that an unfair effect of bias in assessments is when assessments determine which opportunities become available for students; for example, when a biased assessment leads to the placement of more ELs than non-ELs in special education. Therefore, test results should be interpreted while considering students’ culture. A notable trend observed in performance gaps also involves students with low SES. Hanushek et al. (2019) report that the relationship between SES and academic achievement has been observed in research since the 1930s and continues to impact students today. Allee-Herndon and Roberts (2017) explain that students living in poverty not only enter school with lower academic skills than their peers, but the performance gap widens as students reach high school. Low SES impacts students in many ways, such as parent education level, accessibility to quality schools, and familial stress (Hanushek et al., 2019). Low SES specifically affects ELs because Hispanics are among the most disadvantaged communities in the United States (Moshayedi, 2018) and, as mentioned earlier, Spanish-speaking ELs make up the biggest EL group. Therefore, low SES may explain the lasting performance gap between ELs and non-ELs students despite all the changes made to assessments. Another explanation for the performance gap between ELs and their peers may be the English language demand of the tests (Abedi & Gándara, 2006; Menken, 2010). EL students may have the content knowledge, but their low English

98 Jamal Abedi and Cecilia Sanchez

language comprehension may lead to incorrect answers (Abedi, 2004; Shin, 2018). Afitska and Heaton (2019) studied science-based test scores for students with varying levels of English proficiency (ELs and English-native speakers). Among the 485 students sampled, the researchers found that ELs “were particularly disadvantaged when responses required active language production and/or when assessed on specific scientific vocabulary.” Unnecessary linguistic complexity also affects the language demand on assessments. Research shows that reducing test language complexity improves EL performance and reduces the performance gap between ELs and non-ELs (Abedi & Lord, 2001). Thus, recent research reiterates the importance of reducing construct-irrelevant factors such as linguistic complexities in testing in order to accurately assess ELs.

Accommodation Influence The introduction of accommodations was another major milestone in the history of EL assessment practices. Accommodations were first introduced in the field of special education in 1975 under the Individuals with Disabilities Education Act (Willner et al., 2009). Many students with disabilities (SWDs) needed specific forms of assistance in classroom settings and assessments in order to “level the playing field.” Early test accommodations for SWDs included changes to time, response format, setting, additional materials or equipment, and alternative forms of displaying the tests (Lane & Leventhal, 2015). For example, deaf and hard of hearing students needed hearing aids to offset the effect of their inability to hear at the standard level. Similarly, blind or visually impaired students were accommodated with Braille versions of tests. These accommodations were used to increase equity in classroom instruction and assessments. With a growing desire to include ELs in large-scale assessments, accommodations for ELs gained more attention in the mid-1990s (Willner et al., 2008). Li and Suen (2012) indicated accommodations used for SWDs were often used for ELs without ensuring that the accommodations were effective and valid for this group of students; therefore, some of these accommodations may not have been relevant for EL students. For example, small group settings, when adapted for EL students, did not show significant EL improvement (Abedi & Ewers, 2013). In fact, Willner et al. (2008) explained that any accommodations related to timing, scheduling, and test setting (with the exception of extra time, explained later) do not directly affect the linguistic needs of ELs and thus do not have a significant impact on EL test performance. Early EL accommodations included separate test settings, additional time, breaks between testing sections, and glossaries (Gándara & Baca, 2008). Given that three out of the four EL accommodations were related to timing and setting, it’s clear that most EL accommodations were simply borrowed from SWDs. Consequently, researchers have focused on accommodations that prove to be effective, valid, relevant, and feasible, specifically developed for ELs (Abedi &

The Assessment of English Learners 99

Ewers, 2013; Abedi, 2016). The National Center for Education Statistics (NCES) has conducted a broad number of studies on the use of EL accommodations and their effects on the NAEP (Stancavage et al., 1996; Goldstein, 1997). Abedi (2016) indicates that accommodations should make the assessment more accessible to ELs (effectiveness) and not alter the focal construct (validity). In a meta-analysis, Li and Suen (2012) found that some accommodations could increase the scores of EL students without affecting the scores of non-ELs (although the ELs’ score improvements were minor).

Accommodation assignment As mentioned earlier, No Child Left Behind (2002) not only significantly expanded state testing but also mandated that ELs be included as a subgroup. As a result, accommodations for assessments grew in importance. Accommodations permitted under NCLB included “extra time, small group administration, flexible scheduling, simplified instructions, audio-taped instructions in the native language or English, or providing additional clarifying information” (Wright, 2005). Schools were allowed to decide the use and type of accommodations, defined generally as “reasonable.” With such broadly defined guidance, states and schools did not necessarily provide ELs with accommodations (Gándara & Baca, 2008). ESSA has similar guidance, and some states have correspondingly lagged in compliance, especially in providing assessments in students’ native language (Mitchell, 2018). An unresolved but critical issue is the accommodation or combination of accommodations that are most appropriate for each EL. In a study of 35 fifth grade students conducted by De Backer et al. (2019), the researchers found that students believed that accommodations would lead to more inclusion and greater learning. However, some pupils believed that students should receive accommodations if their test scores classified them as an EL, while others believed that such decisions should involve both test scores and the teacher’s input (De Backer et al., 2019). Ultimately the study’s researchers agreed that accommodations should be a collaborative decision involving the classroom teacher, ESL teacher, and others who play an important role in the education of the student (Willner et al., 2009). Similar questions about appropriate accommodation assignment exist across states and there is growing pressure to implement guidelines on how to best make assessments accessible for ELs (Thurlow & Kopriva, 2015).

Accommodation techniques EL accommodations strategies have evolved since the early 1990s, but generally include linguistic modification, bilingual assessment, read-aloud, extended or modified testing time, glossaries, and most recently, computer-based assessment. We discuss these accommodations in turn.

100 Jamal Abedi and Cecilia Sanchez

Linguistic modification Linguistic modification, or lowering linguistic complexity, has been one of the most promising accommodations for ELs (Abedi & Ewers, 2013; Thurlow & Kopriva, 2015; Willner et al., 2009). Unnecessary linguistic complexity will jeopardize assessments’ validity as it hinders the ability of EL students to accurately respond to the content of the questions (Abedi, 2006). Abedi and Lord’s study (2001) found that ELs scored significantly higher on linguistically modified mathematics assessments. Similarly, a study that assessed 3,000 students found that students “struggling with English” (i.e., ELs) performed significantly better on a linguistically modified mathematics test than their EL peers who took the nonmodified version of the test (U.S. Department of Education et al., 2012).

Examples of linguistic modification Abedi (2015) lists the 14 linguistic features that should be modified to reduce linguistic demands on EL assessments: (1) word frequency/familiarity, (2) word length (3) sentence length, (4) passive voice constructs, (5) long noun phrases, (6) long question phrases, (7) comparative structures, (8) prepositional phrases, (9) sentence and discourse structure, (10) subordinate clauses, (11) conditional clauses, (12) relative clauses, (13) concrete versus abstract or impersonal presentations and (14) negation. Below are some linguistic modification examples, as provided by Abedi (2015): (1) Word frequency/familiarity: Potentially unfamiliar, low-frequency lexical items were replaced with more familiar, higher frequency lexical items. Original: A certain reference file contains approximately six billion facts. Revision: Mack’s company sold six billion pencils. (11) Conditional clause: Some conditional “if” clauses were replaced with separate sentences. In some instances, the order of the “if” clause and the main clause was reversed. Original: If x represents the number of newspapers that Lee delivers each day… Revision: Lee delivers x newspapers each day. (12) Relative clauses: Some relative clauses were removed or recast. Original: A report that contains 64 sheets of paper… Revised: He needs 64 sheets of paper for each report.

The Assessment of English Learners 101

While linguistic modifications seem promising, research has led to mixed results to support linguistic modification. A meta-analysis conducted by Li and Suen (2012) revealed no statistically significant results when examining the effects of linguistic modifications. Abedi and Ewers (2013) also analyzed various EL accommodation techniques and found mixed results to support linguistic modification. In a recent randomized study of approximately 3,000 students, Abedi et al. (2020) found that linguistic modification led to EL improvement on a computerized mathematics assessment; however, the results were not significant. The researchers believe that the strong emphasis on language modification in recent years has led to assessments, including the one they developed for their study, that have been linguistically simplified to the point that further modification will have limited effects.

Bilingual/translated assessments Research has indicated the benefits of bilingual programs, and consequently translated assessments have been tested. For example, in nine out of 13 studies reviewed, Slavin and Cheung (2005) found that bilingual education proved to be more effective than English-only programs. Additionally, as mentioned by García et al. (2008), the bilingual students not only developed their English skills at a higher level of proficiency than the English-only students, but also increased their language skills in their first language. Given such findings, recent adaptations for translated and bilingual versions of assessments as forms of accommodations have been implemented. Researchers have inferred that translated versions of assessments may benefit ELs who have been taught the content material in their first language and who demonstrate greater proficiency in their first language than English (Abedi & Ewers, 2013; Li & Suen, 2012). However, since ELs in the United States are mainly taught content-based material in English, Spanish translated versions are not as effective (Li & Suen, 2012). Another issue with translations as accommodation is that it is difficult to ensure that the original test and translated version are measuring the same content (Li & Suen, 2012). Turkan and Oliveri (2014) note that translated assessments may have variations in dialect and cultural knowledge and ensuring assessments have been properly translated is time consuming and costly.

Glossaries Glossary accommodations provide EL test takers with dictionary-based definitions of carefully selected word(s) for each assessment item. Definitions are not content-related; thus, the validity of the test is not compromised. However, without extra time as a second accommodation, glossary accommodations may result in lower scores (Li & Suen, 2012). A reason for this finding could be that the additional information from the glossary slows students down to the point that students can not complete the test in time. Also, according to Willner et al.

102 Jamal Abedi and Cecilia Sanchez

(2009) glossaries and other language-based accommodations have shown to impact EL students differentially based on their level of English proficiency, in such that, glossaries may be best suited for ELs at the middle to high levels of English proficiency. Finally, several research studies indicate little or no positive effect from glossary accommodations (Abedi et al., 2020; Wolf et al., 2012).

Read-aloud Read-aloud accommodations, also known as “text-to-speech,” provide students with an audio read-aloud of the assessment. Geva and Zadeh (2006) state that a majority of ELs have oral skills that exceed their literacy skills in English. This difference in oral skills is apparent in ELP assessments since ELs tend to score higher in the oral (speaking and listening) sections than the written (reading and writing) sections (Hakuta et al., 2000). Considering that EL students have a higher level of proficiency in receptive-oral skills compared to literacy skills, readaloud accommodations appear promising. Wolf et al. (2009) investigated the effectiveness of the read-aloud accommodation on a mathematics test for EL students from two different states; ultimately the results of the study were mixed. ELs from one state improved significantly while ELs from the other state did not. Researchers believe that the differences between states might be explained by accommodations exposure because students from one state had more experience and interaction with readaloud accommodations than the other before participating in the study. Although additional research is necessary, this finding suggests that accommodations may be more effective when students have sufficient experience in utilizing the specific accommodation.

Extended Time As mentioned earlier, the extended time accommodation for ELs was directly borrowed from accommodations for SWDs (Li & Suen, 2012). Researchers have found that extra time provides statistically significant results for ELs in some cases (Abedi & Ewers, 2013). A meta-analysis conducted by Li and Suen (2012) found that the extended time accommodation appeared more promising when compared to linguistic modifications and glossaries, although results were not statistically significant and produced higher scores for both ELs and non-ELs. Thus, an issue with the extended time accommodation is that it can make assessments more accessible to all students.

Computer-based accommodations The transition to computerized assessments allows for different ways to provide accommodations to students. In a recent study conducted by Abedi et al. (2020),

The Assessment of English Learners 103

different types of accommodations were assigned to students using a computerized system. The accommodation assignment was based on a student background questionnaire, which inquired about the students’ language, and a short language proficiency test. Based on the responses, the system determined whether the student would receive an accommodation in English or in Spanish. Students were randomly assigned to one of the accommodations (as determined by the language variables) or the original version of the test. Such a system could easily be applied to large scale assessments and factor in broad variables such as state or local assessment scores, teacher recommendations, ELP scores and other variables that are likely to determine the accommodation(s) best suited for each student. Computer-based assessments can also allow students to select their own accommodation(s). Researchers can then analyze a broad number of factors, including which accommodation(s) students selected and the amount of time spent on each test item. In a study of a computer-based assessment with two selfselect accommodations, Roohr and Sireci (2017) found that ELs used accommodations more than non-ELs and that the use of the accommodations decreased as the assessment proceeded in both student groups. The researchers also found that many students did not use any accommodations. We believe that computerbased assessments pose important opportunities for accommodation delivery and preference.

EL Students with Learning Disabilities (LDs) The dual classification of ELs with LDs has caused much debate since the start of wide-scale testing. As mentioned earlier, intelligence testing in the early 1900s often placed ELs into disability classrooms. In 1970, Diana v. California State Board of Education showed that EL students scoring low on English-based assessments were still being categorized with disabilities (Mclean, 1995). Recent studies and reports have found both over- and under-classification of ELs into disability categories across different states (Data Accountability Center, 2015; U.S. Department of Education & NCES, 2015a, 2015b). For example, states with the largest number of ELs have a much higher percentage of EL students labeled as an EL/ LD when compared to non-EL/LD (Carnock & Silva, 2019). California and New Mexico, for example, tend to show over-identification, while states like New Jersey appear to under-identify ELs in disability categories (Data Accountability Center, 2015; U.S. Department of Education & NCES, 2015a, 2015b). Research shows that most ELs with LDs take the same statewide and ELP assessments as their peers, while alternative forms of the tests are reserved for students with the most significant cognitive disabilities (Albus et al., 2019; Gholson & Guzman-Orth, 2019). Since a majority of LDs are not classified as significant disabilities, there is not an alternative version of statewide assessments or ELP assessments for EL students with LDs. Although inclusion in the tests allows schools to be held accountable and track ELs with disabilities’ progress in school,

104 Jamal Abedi and Cecilia Sanchez

ELs with LDs score much lower than other subgroups (Abedi & Gándara, 2006). In fact, dual-identified students (EL/disability including EL/LRD) perform lower than their non-disabled and non-EL peers (Liu et al., 2015). Park and Thomas (2012) mention that ELs with disabilities may not benefit from statewide tests because it “often includes unreliable and invalid measures of academic performance for students with special needs.” In order to address these issues, ELs with LDs should have access to appropriate accommodations and educators should not compare the performance of ELs with disabilities based on norms for non-ELs without disabilities (Park & Thomas, 2012). As of 2015, ESSA requires schools to report on ELP and academic progress on statewide tests for students with a combined EL and disability classification (Albus et al., 2019; Liu et al., 2018). Previously this data was provided separately for ELs and students with LDs (Abedi & Gándara, 2006). This change increases schools’ accountability for improving ELP development and academic achievement of ELs with LDs. However, the change makes it crucial to accurately distinguish between ELs with LDs from ELs with low English-language skills. Improper identification could hinder ELs with the disability label or prevent students from receiving services or accommodations on assessments needed for learning.

Summary The history of large-scale student assessment began with intelligence tests (Mclean, 1995). Despite non-verbal formats, ELs performed lower than non-ELs on intelligence tests, often leading to classroom segregation based on ethnicity (Cremin, 1961; Tyack, 1974). By the 1930s educational experts moved to achievement testing because intelligence tests often did not measure what students were taught in schools (U.S. Congress, 1992, p. 122). Even with this major assessment change, ELs continued to be overrepresented in disability classrooms. Furthermore, there were questions regarding content representation and interpretation of the test results (U.S. Congress, 1992, p. 165). In order to make assessments fair, a series of federal laws and legislation was initiated in the 1960s to improve EL learning and assessment. By the 1990s, educational leaders created content standards and methodologies for establishing the alignment of assessment items with content standards. While these efforts helped with the quality and objectivity of content-based assessments, the gap between ELs and their native English-speaking peers remained (Abedi & Gándara, 2006). Among probable factors for low EL performance, unnecessary linguistic complexity seemed most important. Multiple studies suggested that ELs’ low level of English proficiency might have prevented them from understanding instructional materials and assessment questions (Abedi, 2004; Shin, 2018). Subsequently, accommodations were adapted from SWDs to EL students. However, NAEP and other large-scale assessments show that accommodations have yet to narrow the EL and non-EL achievement gaps (Li & Suen, 2012).

The Assessment of English Learners 105

The persistent EL achievement gap suggests that we look outside our educational systems for potential solutions. For example, what role might formative assessment play in EL achievement? Are there lingering effects of cultural bias that continue to contribute to the performance gap between ELs and non-ELs? What accommodations strategies make assessments most accessible to ELs and ELs with LDs? How can technology be used to create better assessments and accommodations? Do SES factors have far greater influence on ELs learning than the role of language? Answering these and other questions pertaining to EL assessment can lead to a historic crossroads in the process of achieving an equitable educational system.

Notes 1 The authors wish to thank Ron Dietel for his thoughtful review of an earlier draft of this chapter. 2 In this chapter, we use the term “English learners (EL)” instead of the term “bilingual” because many bilingual children are as proficient in English as native-English children.

References Abedi, J. (2004). The No Child Left Behind Act and English language learners: Assessment and accountability issues. Educational Researcher, 33(1), 4–14. Abedi, J. (2006). Psychometric Issues in the ELL assessment and special education eligibility. Teachers College Record, 108(11), 2282–2303. Abedi, J. (2015). Language issues in item-development. In S. Lane, M. S. Raymond, & T. M. Haladyna (Eds.), Handbook of test development (2nd ed.). Routledge. Abedi, J. (2016). Utilizing accommodations in assessment. In E. Shohamy, L. Or, & S. May (Eds.), Language testing and assessment: Encyclopedia of language and education (3rd ed.). Springer. Abedi, J. & Ewers, N. (2013). Accommodations for English language learners and students with disabilities: A research-based decision algorithm. Smarter Balanced Assessment Consortium. Retrieved from https://portal.smarterbalanced.org/library/en/accommodations-forenglish-language-learners-and-students-with-disabilities-a-research-based-decision-algo rithm.pdf. Abedi, J. & Gándara, P. (2006). Performance of English language learners as a subgroup in large-scale assessment: Interaction and policy. Educational Measurement: Issues and Practice, 25(4), 36–46. Abedi, J. & Lord, C. (2001). The language factor in mathematics tests. Applied Measurement in Education, 14(3), 219–234. Abedi, J., Zhang, Y., Rowe, S. E., & Lee, H. (2020). Examining effectiveness and validity of accommodations for English language learners in mathematics: An evidence‐based computer accommodation decision system. Educational Measurement: Issues and Practice, 39(4), 41–52. https://doi.org/10.1111/emip.12328. Achieve & UnidosUS. (2018). How are states including English language proficiency in ESSA plans? Retrieved from https://www.achieve.org/files/Achieve_UnidosUS_ESSA% 20ELP%20Indicator_1.pdf. Afitska, O. & Heaton, T. J. (2019). Mitigating the effect of language in the assessment of science: A study of English‐language learners in primary classrooms in the United Kingdom. Science Education, 103(6), 1396–1422.

106 Jamal Abedi and Cecilia Sanchez

Albus, D. A., Liu, K. K., Thurlow, M. L., & Lazarus, S. S. (2019). 2016–17 publicly reported assessment results for students with disabilities and ELs with disabilities (NCEO Report 411). University of Minnesota, National Center on Educational Outcomes. Allee-Herndon, K. A. & Roberts, S. K. (2017). Poverty, self-regulation and executive function, and learning in K-2 classrooms: A systematic literature review of current empirical research. Journal of Research in Childhood Education, 33(3), 345–362. Anderson, K.S. & Dufford-Meléndez, K. (2011). Title III accountability policies and outcomes for K–12: annual measurable achievement objectives for English language learner students in Southeast Region states. (Issues & Answers Report, REL2011–No. 105). Washington, DC: U.S. Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance, Regional Educational Laboratory Southeast. Retrieved from http://ies.ed.gov/ncee/edlabs. Braden, J. P. (2000). Editor’s introduction: Perspectives on the nonverbal assessment of intelligence. Journal of Psychoeducational Assessment, 18, 204–2010. Calderon, B. (2015). English language learners in the Elementary and Secondary Education Act. National Council of La Raza. Retrieved from http://publications.unidosus.org/bitstream/ handle/123456789/1418/ell_esea.pdf. Carnock, J. T. & Silva, E. (2019). English learners with disabilities: Shining a light on dual-identified students. New America, Educational Policy. Retrieved from https://d1y8sb8igg2f8e.cloud front.net/documents/English_Learners_with_Disabilities_Shining_a_Light_on_Dual-Identi fied_Students_TaYvpjn.pdf. Center on Standards and Assessments Implementation (CSAI) (2019). Assessing English learners under the Elementary and Secondary Education Act, as amended by the Every Student Succeeds Act: CSAI update. WestEd and Center for Research on Evaluation Standards and Student Testing (CRESST). Christensen, L. L. (2010). Addressing the inclusion of English language learners in the educational accountability system: lessons learned from peer review. [Doctoral dissertation, University of Minnesota]. ProQuest Dissertations Publishing. Cramer, E., Little, M. E., & McHatton, P. A. (2018). Equity, equality, and standardization: Expanding the conversations. Education and Urban Society, 50(5), 483–501. Cremin, L. (1961). The transformation of the school: Progressivism in American education, 1876– 1957. Vintage Books. Cubillos, E. M. (1988). The Bilingual Education Act: 1988 legislation. National Clearinghouse for Bilingual Education, 7, 1–26. Data Accountability Center (2015). Individuals with Disabilities Education Act (IDEA) [data tables for OSEP state reported data]. Retrieved from: http://www2.ed.gov/programs/ osepidea/618-data/state-level-data-files/index.html. De Backer, F., Slembrouck, S., & Van Avermaet, P. (2019). Assessment accommodations for multilingual learners: pupils’ perceptions of fairness. Journal of Multilingual and Multicultural Development, 40(9), 833–846. De Cohen, C. C. & Clewell, B. C. (2007). Putting English language learners on the educational map: The No Child Left Behind Act implemented. Education in Focus: Urban Institute Policy Brief. Friedberg, S., Barone, D., Belding, J., Chen, A., Dixon, L., Fennell, F., Fisher, D., Frey, N., Howe, R., & Shanahan, T. (2018). The state of state standards post-common core. Thomas B. Fordham Institute. Gándara, P. & Baca, G. (2008). NCLB and California’s English language learners: The perfect storm. Language Policy, 7, 201–216.

The Assessment of English Learners 107

García, O., Kleifgen, J. A., & Falchi, L. (2008). From English language learners to emergent bilinguals. Equity Matters, Teachers College, Columbia University. Geva, E., & Zadeh, X. Y. (2006). Reading efficiency in native English speaking and English-as-a-second-language children: The role of oral proficiency and underlying cognitive-linguistic processes. Scientific Studies of Reading, 10(1), 31–57. Gholson, M. L., & Guzman-Orth, D. (2019). Developing an alternate English language proficiency assessment system: A theory of action (Research Report No. RR-19-25). Educational Testing Service. https://doi.org/10.1002/ets2.12262. Goldstein, A. A. (1997). Design for increasing the participation of students with disabilities and limited English proficient students in the National Assessment of Educational Progress (NAEP). Paper presented at the annual meeting of the American Educational Research Association, Chicago. Hakuta, K., Butler, Y. G., & Witt, D. (2000). How long does take English learners to attain proficiency? Policy Report 2000–2001. The University of California Linguistic Minority Research Institute. Stanford University. Hanushek, E. A., Peterson, P. E., Talpey, L. M., & Woessmann, L. (2019) The unwavering SES achievement gap: Trends in U.S. student performance. National Bureau of Economic Research. Herrnstein, R. J. & Murray, C. (1994). The bell curve: Intelligence and class structure in American life. Free Press. Hutt, E. & Schneider, J. (2017). A history of achievement testing in the United States or: Explaining the persistence of inadequacy. Teachers College Record. Kamin, L. J. (1974). The science and politics of IQ. Social Research, 41(3), 387–425. Kim, K. H. & Zabelina, D. (2015). Cultural bias in assessment: Can creativity assessment help? International Journal of Critical Pedagogy, 6(2), 129–148. Klein, A. (2014). Historic summit fueled push for K–12 standards. Education Week, 34(5), 18–20. Kruse, A. J. (2016). Cultural bias in testing: A review of literature and implications for music education. National Association for Music Education, 35(1), 23–31. LaCelle-Peterson, M. & Rivera, C. (1994). Is it real for all kids? A framework for equitable assessment policies for English language learners. Harvard Educational Review, 64(1), 55–76. Lane, S. & Leventhal, B. (2015). Psychometric challenges in assessing English language learners and students with disabilities. Review of Research in Education, 39(1), 165–214. Li, H. & Suen, H. K. (2012). The effects of test accommodations for English language learners: A meta-analysis. Applied Measurement in Education, 25(4), 327–346. Linn, R. L. & Gronlund, N. E. (1995) Measurement and assessment in teaching. Prentice-Hall International. Liu, K. K., Thurlow, M. L., Press, A. M., & Lickteig, O. (2018). A review of the literature on measuring English language proficiency progress of English learners with disabilities and English learners. NCEO Report 408. National Center on Educational Outcomes. Retrieved from https://files.eric.ed.gov/fulltext/ED591957.pdf. Liu, K. K., Ward, J. M., Thurlow, M. L., & Christensen, L. L. (2015). Large-scale assessment and English language learners with disabilities. Educational Policy, 31(5), 551–583. Lohman, D.F., Korb, K.A., & Lakin, J.M. (2008). Identifying academically gifted English language learners using nonverbal tests: A comparison of the Raven’s, NNAT, and CogAT. Gifted Child Quarterly, 52(4), 275–296. McFarland, J., Hussar, B., Zhang, J., Wang, X., Wang, K., Hein, S., Diliberti, M., Forrest Cataldi, E., Bullock Mann, F., & Barmer, A. (2019). The condition of education 2019

108 Jamal Abedi and Cecilia Sanchez

(NCES 2019–2144). U.S. Department of Education. Washington, DC: National Center for Education Statistics. Retrieved from https://nces.ed.gov/pubsearch/pubsinfo.asp? pubid=2019144. Mclean, Z. Y. (1995). History of bilingual assessment and its impact on best practices used today. New York State Association for Bilingual Education Journal, 10, 6–12. Menken, K. (2010). No Child Left Behind (NCLB) and English Language Learners: Challenges and Consequences. Theory into Practice, 49(2), 121–128. Millard, M. (2015). State funding mechanisms for English language learners. Education Commission of the States. Mitchell, C. (2018). English-learners and ESSA: Many states are lowering academic goals, advocates charge. Education Week. Mojica, T. C. (2013). An examination of English language proficiency and achievement test outcomes. [Doctoral dissertation, Temple University]. ProQuest Dissertations Publishing. Moshayedi, S. (2018). Elementary principal leadership and learning outcomes for low socioeconomic status Hispanic English learners. [Doctoral dissertation, University of Southern California]. ProQuest Dissertations Publishing. National Center for Education Statistics (NCES) (2020a). English language learners in public schools. Institute of Educational Science, NCES. Retrieved from https://nces.ed.gov/ programs/coe/indicator_cgf.asp. National Center for Education Statistics (NCES) (2020b). 2019 Reading grades 4 and 8 assessment report cards: Summary data tables for national and state average scores and NAEP achievement level results. Institute of Educational Science, NCES. Retrieved from https:// www.nationsreportcard.gov/reading/supportive_files/2019_Results_Appendix_Reading_ State.pdf. Nation’s Report Card (2020). National Assessment of Educational Progress (NAEP) report card: Mathematics, National student group scores and score gaps. Retrieved from https://www. nationsreportcard.gov/mathematics/nation/groups/?grade=8. Park, Y. & Thomas, R. (2012). Educating English-language learners with special needs: Beyond cultural and linguistic considerations. Journal of Education and Practice, 3(9), 52–58. Raven, J. C. (2003). Raven’s Coloured Progressive Matrices (CPM). Pearson. Rentner, D. S., Kober, N., & Bruan, M. (2019). State leader interviews: How states are responding to ESSA’s evidence requirements for school improvement. Center on Education Policy, George Washington University. Romberg, T. (1993). National Council of Teachers of Mathematics (NCTM) standards: A rallying flag for mathematics teachers. Educational Leadership, 50(5), 36–41. Roohr, K. C. & Sireci, S. G. (2017). Evaluating computer-based test accommodations for English learners. Educational Assessment, 22(1), 35–53. Shin, N. (2018). The effects of the initial English language learner classification on students’ later academic outcomes. Educational Evaluation and Policy Analysis, 40(2), 175–195. Slavin, R., & Cheung, A. (2005). A synthesis of research on reading instruction for English language learners. Review of Educational Research, 75(7), 247–284. Stancavage, F., McLaughlin, D., Vergun, R., Godlewski, C., & Allen, J. (1996). Study of exclusion and assessability of students with limited English proficiency in the 1994 trial state assessment of the National Assessment of Educational Progress. In National Academy of Education (Eds.). Quality and Utility: The 1994 Trial State Assessment in Reading (pp. 172–175). National Academy of Education, Stanford. Stewner-Manzanares, G. (1988). The Bilingual Education Act: Twenty years later. National Clearinghouse for Bilingual Education, 6, 1–10.

The Assessment of English Learners 109

Sutton, L. C., Cornelius, L., & McDonald-Gordon, R. (2012). English language learners and judicial oversight: Progeny of “Castaneda.” Educational Considerations, 39(2), 30–37. Tanenbaum, C. & Anderson, L. (2010). Title III Accountability and District Improvement Efforts: A Closer Look. ESEA Evaluation Brief: The English Language Acquisition, Language Enhancement, and Academic Achievement Act. U.S. Department of Education. Thurlow, M. L. & Kopriva, R. J. (2015). Advancing accessibility and accommodations in content assessments for students with disabilities and English learners. Review of Research in Education, 39(1), 331–369. Turkan, S. & Oliveri, M.E. (2014). Considerations for providing test translation accommodations to English language learners on common core standards-based assessments. ETS RR-14-05. Educational Testing Service Research Report Series. Tyack, D. B. (1974). The one best system: A history of American urban education. Harvard University Press. UnidosUS. (2018). English learners and the Every Student Succeeds Act: A tool for advocates in California. Retrieved from http://publications.unidosus.org/bitstream/handle/123456789/ 1876/CA_EL_EquityReportFINAL.pdf?sequence=1&isAllowed=y. U.S. Congress, Office of Technology Assessment (1992). Testing in American schools: Asking the right questions, OTA-SET-519. Washington, DC: U.S. U.S. Department of Education (2017). Every Student Succeeds Act Assessments under Title I, Part A & Title I, Part B: Summary of Final Regulations. Retrieved from https://www2.ed. gov/policy/elsec/leg/essa/essaassessmentfactsheet1207.pdf. U.S. Department of Education, Institute of Education Sciences (IES), & What Works Clearinghouse (WWC) (2012). WWC review of the report: Accommodations for English language learner students: The effect of linguistic modification of math test item sets. U.S. Department of Education & National Center for Education Statistics (NCES) (2015a). Common Core of Data (CCD), State Nonfiscal Survey of Public Elementary/ Secondary Education, 1990–1991 through 2012–2013; and State Public Elementary and Secondary Enrollment Projection Model, 1980 through 2024. Retrieved from: http:// nces.ed.gov/programs/digest/d1. U.S. Department of Education & National Center for Education Statistics (NCES) (2015b). Common Core of Data (CCD), Local Education Agency Universe Survey, 2002–2003 through 2012–2013. Retrieved from: https://nces.ed.gov/programs/digest/ d14/tables/dt14_204.20.asp. U.S. Department of Education, Office of English Language Acquisition (OELA) (2017). English Learner Tool Kit (Rev. ed.). Washington, DC: U.S. U.S. Department of Education, Office of English Language Acquisition (OELA) (2019). Languages spoken by English learners (ELs): Fast facts. Retrieved from https://ncela.ed.gov/ files/fast_facts/olea-top-languages-fact-sheet-20191021-508.pdf. Vernon, P.E. (1969). Intelligence and cultural environment. Routledge. Wade, D. L. (1980). Racial discrimination in IQ testing: Larry P. v. Riles. DePaul Law Review, 29(4), 1–25. Wiese, A. & Garcia, E. E. (2001). The Bilingual Education Act: Language minority students and US federal educational policy. International Journal of Bilingual Education and Bilingualism, 4(4), 229–248. Willner, L. S., Rivera, C., & Acosta, B. D. (2008). Descriptive Study of State Assessment Policies for Accommodating English Language Learners. Center for Equity and Excellence in Education, George Washington University.

110 Jamal Abedi and Cecilia Sanchez

Willner, L. S., Rivera, C., & Acosta, B. D. (2009). Ensuring accommodations used in content assessments are responsive to English-language learners. The Reading Teacher, 62(8), 696–698. Wright, W. E. (2005). Evolution of federal policy and implications of No Child Left Behind for language minority students. Policy brief. University of Texas at San Antonio. Wolf, M. K., Kim, J., & Koa, J. (2012). The effects of glossary and read-aloud accommodations on English language learners’ performance on a mathematics assessment. Applied Measurement in Education, 25(4), 347–374. Wolf, M. K., Kim, J., Kao, J. C., & Rivera, N. M. (2009). Examining the effectiveness and validity of glossary and read-aloud accommodations for English language learners in a math assessment (CRESST Report 766). University of California, National Center for Research on Evaluation, Standards, and Student Testing (CRESST). Zajda, J. (2019). Current research of theories and models of intelligence. Curriculum and Teaching, 34(1), 87–108. Zascavage, V. (2010). Elementary and Secondary Education Act. In T. C. Hunt, J. C. Carper, T. J.LasleyII, & D. Raisch (Eds). Encyclopedia of Educational Reform and Dissent (pp. 338–340). http://dx.doi.org/10.4135/9781412957403.n149.

6 EVOLVING NOTIONS OF FAIRNESS IN TESTING IN THE UNITED STATES Stephen G. Sireci and Jennifer Randall1

A colleague from Spain spent a sabbatical with us at the University of Massachusetts a few years ago. When asked about what his son learned while going to school in the United States, his response was, “He now says, ‘It’s not Fair!’” Although fairness is by no means a uniquely American concept, it is engrained in our Constitution and is the key criterion on which almost every social and judicial issue is debated. Simply put, fairness is the lens through which we view almost every action that influences our lives. With respect to educational and psychological testing, claims have been made that tests promote fairness. Tests can provide a level playing field for the pursuit of many desirable goals such as promotion, admission, and scholarship. Although such leveling of the playing field is true in many cases, and tests have certainly promoted fairness in many realms; a review of the history of testing in the U.S. includes many examples of tests being used to create, promote, and maintain intentionally unfair systems of oppression. Thus, the history of fairness in testing is incomplete without also discussing this history of explicit unfairness. In this chapter, we review the history of fairness in testing by first discussing non-technical and technical notions of fairness, and then use historical examples to illustrate both the incorporation of principles of fairness into the testing process and the intentional illusion of fairness in testing to achieve nefarious objectives. We discuss historical notions of fairness presented both in defense of and in opposition to test use and testing practices. We juxtapose these testing practices and their intended fairness/unfairness against the historical climate and larger educational and social policies. After reviewing current definitions of fairness in the general public and in the psychometric literature, we review the evolution of fairness in testing as articulated in the evolution of the Standards for Educational and Psychological Testing

112 Stephen G. Sireci and Jennifer Randall

(American Educational Research Association [AERA], American Psychological Association [APA], & National Council on Measurement in Education [NCME], 2014). Next, we move to a brief overview of the initiation and proliferation of standardized tests as a gateway to a more transparent and fair system of assessment. We trace this movement and the critiques of it as it relates to fairness. Within this context, we also touch upon the measurement field’s approach to both ensuring and evaluating test fairness. Throughout the chapter, we pay special attention to issues of fairness/unfairness in testing practices with respect to historically marginalized populations and discuss the impact of these (historical) testing practices on current perceptions of test fairness in these populations.

Definitions of Fairness Non-Technical Definitions Before describing how fairness has been defined in the psychometric literature, we first illustrate how fairness is defined in general. In the first edition of the American Dictionary of the English Language (Webster, 1828), Noah Webster defined fairness as: 1. 2. 3. 4. 5. 6. 7. 8.

Clearness; freedom from spots or blemishes; whiteness; as the fairness of skin or complexion. Clearness; purity; as the fairness of water. Freedom from stain or blemish; as the fairness of character or reputation. Beauty; elegance; as the fairness of form. Frankness; candor; hence, honesty; ingenuousness; as fairness in trade. Openness; candor; freedom from disguise, insidiousness or prevarication; as the fairness of an argument. Equality of terms; equity; as the fairness of a contract. Distinctness; freedom from blots or obscurity; as the fairness of hand-writing; the fairness of a copy.

Five of the definitions refer to appearance, while three refer to fairness of a process or action. Almost one hundred years later, Webster’s Collegiate Dictionary (Webster, 1923), defined fairness simply as, “State or quality of being fair” (p. 362), and provided 13 definitions for “fair;” ten of which refer to appearance or weather conditions. The two definitions relevant to fairness in conduct are: 1. 2.

Characterized by frankness, honesty or impartiality; open; just Open to legitimate pursuit; chiefly in fair game

At first blush it appears the “equality” connotation of fairness present in 1823 was absent by 1923; however, the synonyms given for “fair” in the 1923

Evolving Notions of Fairness in Testing 113

dictionary (just, equitable, unprejudiced, impartial, unbiased, disinterested) emphasized justice and equity. In fact, the commentary on these synonyms reads as follows: Fair, impartial, unbiased, disinterested imply freedom from undue influence. Fair implies, negatively, absence of injustice or fraud; positively, the putting of all things on an equitable footing. Impartial implies absence of a favor for one party more than the other; Unbiased expresses even more strongly lack of prejudice or prepossession. Disinterested denotes that freedom from bias due to the absence of selfish interest. (p. 362, emphases in original) These descriptions seem consistent with general notions of fairness in any assessment or competition. Today, the online version of Webster’s Dictionary has the same definition for fairness as in 1923 (i.e., state or quality of being fair), and the first definition provided for the adjective “fair” is “marked by impartiality and honesty: free from self-interest, prejudice, or favoritism” (Merriam-Webster, 2020) The second definition provided is, “conforming with the established rules.” The synonyms provided in 2020 are similar to those found in 1923 (i.e., just, equitable, impartial, unbiased, dispassionate, objective), but it is interesting to note that the meanings of fairness associated with appearance became secondary definitions over time, while those pertaining to equity and justice became primary; and “objective” was added to the list of synonyms. To summarize non-technical perceptions of fairness, the general meaning of the word refers to impartial treatment that is free from biases such as favoritism, nepotism, or racism. It is with these biases in mind that conceptualizations of fairness in educational testing emerged, which we address next.

Psychometric Definitions of Fairness Samuel Messick, one of the most prolific and respected validity theorists of all time, claimed that problems in fair and valid assessment arise from either construct underrepresentation or construct-irrelevant variance. As he put it, “Tests are imperfect measures of constructs because they either leave out something that should be included … or else include something that should be left out, or both” (Messick, 1989, p. 34). Construct underrepresentation refers to a situation in which a test measures only a portion of the intended construct and leaves important knowledge, skills, and abilities untested. Construct-irrelevant variance refers to a situation in which the test measures other constructs that are irrelevant to what is intended to be measured. Concerns regarding fairness in educational testing can often be grouped into one of those two categories. Critics of tests who claim unfairness due to the content of the test often claim a test is measuring unimportant skills, rather than

114 Stephen G. Sireci and Jennifer Randall

the critical, relevant skills (i.e., construct underrepresentation); or skills that should not be measured at all (i.e., construct-irrelevant variance, for example culturespecific knowledge). However, many other complaints of unfairness in testing refer to the use of a test rather than its content. These concerns have been outlined by AERA, APA, and NCME in most, but not all, editions of the Standards for Educational and Psychological Testing.

Evolution of the Concept of Fairness in Testing There have been six versions of professional standards for educational and psychological testing that were developed by a joint committee of AERA, APA, and NCME. The first two versions (APA, 1954; 1966) did not mention fairness at all. It was not until the 1974 version of the Standards (APA, AERA, & NCME, 1974) that concerns for fairness became explicit. The prelude to the 1974 edition essentially described the reason for the developing new edition as to address concerns regarding test fairness. For example, on the first page of the introduction, it stated: Part of the stimulus for revision is an awakened concern about problems like an invasion of privacy or discrimination against members of groups such as minorities or women. Serious misuses of tests include, for example, labeling Spanish-speaking children as mentally retarded on the basis of scores on tests standardized on “a representative sample of American children,” or using a test with a major loading on verbal comprehension without appropriate validation in an attempt to screen out large numbers of blacks from manipulative jobs requiring minimal verbal communication. (APA, 1974, p. 1) This explicit mentioning of concerns for the effects of testing on minority groups was a revolutionary change to these Standards and signaled a major change in the field. Thus, the use of the term “awakened” was an appropriate description. The next version of the Standards (AERA, APA, & NCME, 1985) described the need for the new edition because of “Technical advances in testing and related fields, new and emerging uses of tests, and growing social concerns over the role of testing in achieving social goals” (p. v). To meet that need, this 1985 edition included two chapters that addressed concerns for specific groups of test takers—“linguistic minorities” and “people who have handicapping conditions.” Although the guidance in this version was consistent with promoting fairness for these populations, the concept of fairness in testing was not mentioned as an explicit goal. That changed with the next edition (AERA, APA, & NCME, 1999), which included a chapter on “Fairness in Testing and Test Use.” This 1999 edition acknowledged the complexity of fairness at the outset by stating,

Evolving Notions of Fairness in Testing 115

Concern for fairness in testing is pervasive, and the treatment according to the topic here cannot do justice to the complex issues involved. A full consideration of fairness would explore the many functions of testing in relation to its many goals, including the broad goal of achieving equality of opportunity in our society. … The Standards cannot hope to deal adequately with all these broad issues, some of which have occasioned sharp disagreement among specialists and other thoughtful observers. Rather, the focus of the Standards is in those aspects of tests, testing, and test use that are the customary responsibilities of those who make, use, and interpret tests, and that are characterized by some measure of professional and technical consensus. (p. 73) These 1999 Standards claimed that the term “fairness…has no single technical meaning” (p. 74), and instead presented four views of fairness; specifically, fairness (a) as a lack of bias, (b) as equitable treatment in the testing process, (c) as equality in outcomes of testing, and (d) as opportunity to learn. This version of the Standards also pointed out that “Absolute fairness for every examinee is impossible to attain, if for no other reasons than the facts that tests have imperfect reliability and that validity in any particular context is a matter of degree” (p. 73). They also suggested that “fairness in testing in any given context must be judged relative to that of feasible test and non-test alternatives” (p. 73). With respect to the view of fairness as a lack of bias, the 1999 Standards made the point that “consideration of bias is critical to sound testing practice” (p. 74). Similarly, with respect to the view that fairness relates to equitable treatment in the testing process, these Standards stated, “just treatment throughout the testing process is a necessary condition for test fairness” (p. 74). Those two views remain uncontroversial; however, fairness as equality of testing outcomes is controversial, and fairness as opportunity to learn brings up additional issues. With respect to equality of outcomes, the 1999 Standards stated, Many testing professionals would agree that if a test is free of bias and examinees have received fair treatment in the testing process, then the conditions of fairness have been met. That is, given evidence of the validity of intended test uses and interpretations, including evidence of lack of bias and attention to issues of fair treatment, fairness has been established regardless of group-level outcomes. This view need not imply that unequal testing outcomes should be ignored altogether. They may be important in generating new hypotheses about bias and fair treatment. But in this view, unequal outcomes at the group level have no direct bearing on questions of test fairness. (p. 76) The logic underlying this view is that observed mean test score differences across groups defined by racial or other demographic variables may reflect true

116 Stephen G. Sireci and Jennifer Randall

differences across groups, rather than imperfections of the measurement properties of the test. However, the idea of true differences across groups defined by characteristics such as race or culture is seen by some as reflecting construct-irrelevant variance (Helms, 2006), and so remains a fairness issue. The view of fairness as opportunity to learn refers primarily to achievement testing where students may be tested on material on which they were not instructed; in many cases the majority of students have had the opportunity to learn all content tested, but some students, typically those from minority racial/cultural backgrounds, have not. This issue arose in the case of Debra P. v. Turlington (1981), in which it was established that students must be taught the knowledge and skills measured on a test before they take it, if the test has consequences for students such as receiving a high school diploma (Sireci & Parker, 2006). However, the issue of opportunity to learn also brushes up against differences in values regarding what should be taught and tested. Thus, the opportunity to learn issue overlaps with the long history of privileging certain kinds of knowledge and ignoring or deemphasizing other kinds. These value judgements almost always intersect racial/ethnic lines, which makes fairness issues in opportunity to learn more paramount. The 1999 version of the Standards laid out the broad concerns of fairness in testing and paved the way for broader fairness discussions. The current version of the Standards (AERA et al., 2014) retained the “fairness in testing” chapter, and modified the four views of fairness to fairness (a) in treatment during the testing process, (b) as lack of measurement bias, (c) in access to the construct(s) measured, and (d) as validity of individual test score interpretations for the intended uses. The first two views are consistent with the 1999 version, but fairness as equality in outcomes of testing and opportunity to learn were changed to “fairness in interpretation and uses of test scores,” and the idea of fairness as access to the construct measured was added. The 2014 version described fairness in testing as: A test that is fair within the meaning of the Standards reflects the same construct(s) for all test takers, and scores from it have the same meaning for all individuals in the intended population; a fair test does not advantage or disadvantage some individuals because of characteristics irrelevant to the intended construct. (AERA et al., 2014, p. 50) With respect to fairness in treatment during the testing process, the 2014 Standards stated: Although standardization has been a fundamental principle for assuring that all examinees have the same opportunity to demonstrate their standing on the construct that a test is intended to measure, sometimes flexibility is needed to provide essentially equivalent opportunities for some test takers. (p. 51)

Evolving Notions of Fairness in Testing 117

This flexibility primarily referred to providing accommodations to standard testing conditions for students with disabilities (Randall & Garcia, 2016; Sireci, 2005; Sireci & O’Riordan, 2020). The 2014 Standards essentially supported the notion of modifying standardized testing conditions in pursuit of more valid assessments, as long as the accommodation was not thought to alter the construct measured. As they put it, “greater comparability of scores may be attained if standardized procedures are changed to address the needs of specific groups or individuals without any adverse effects on the validity or reliability of the results obtained” (p. 51). The specific groups mentioned were students with disabilities and linguistic minorities (e.g., English learners). Fairness with respect to access to the construct measured is also relevant to assessing students with disabilities and linguistic minorities. The AERA et al. (2014) Standards described access to the construct(s) measured as, “the notion that all test takers should have an unobstructed opportunity to demonstrate their standing on the construct(s) being measured” (p. 49). They further stated, “access to the construct the test is measuring can be impeded by characteristics and/or skills that are unrelated to the intended construct and thereby can limit the validity of score interpretations” (p. 50). With respect to fairness as lack of measurement bias, the 2014 Standards stated, “Characteristics of the test itself that are not related to the construct being measured, or the manner in which the test is used, may sometimes result in different meanings for scores earned by members of different identifiable subgroups” (p. 51). To evaluate fairness in this context (i.e., comparability of test score meaning across subgroups of examinees), the Standards mention specific statistical analyses such as differential item functioning (Clauser & Mazor, 1998; Sireci & Rios, 2013), differential test functioning (Stark, Chernyshenko, & Drasgow, 2004), predictive bias (e.g., Linn, 1984; Sireci & Talento-Miller, 2006), and construct equivalence (van de Vijver & Leung, 2011). Differential item functioning (DIF) refers to a situation in which test takers who are considered to be of equal proficiency on the construct measured, but who come from different groups, have a different probability of earning a particular score on a test item. DIF is a statistical observation that involves matching test takers from different groups on the characteristic measured and then looking for performance differences on an item. Item bias is present when an item has been statistically flagged for DIF and the reason for the DIF is traced to a factor irrelevant to the construct the test is intended to measure. Therefore, for item bias to exist, a characteristic of the item that is unfair to one or more groups must be identified. Interpretation of the causes of DIF is usually part of the process of “sensitivity review” of test material that can happen both before and after a test is administered (Ramsey, 1993; Sireci & Mullane, 1994). Differential test functioning essentially aggregates the results of DIF analyses to the test level. Potential unfairness due to predictive bias is typically evaluated statistically via analysis of differential predictive validity. Predictive validity is the degree to which

118 Stephen G. Sireci and Jennifer Randall

test scores accurately predict scores on a criterion measure. Differential predictive validity investigates whether the relationship between test and criterion scores is consistent across examinees from different groups. To conduct an analysis of differential predictive validity, data on a relevant external criterion must be available for two or more groups (e.g., minority and non-minority test takers). Most studies of DPV have been conducted on admissions tests, with grade-point-average as the validation criterion (e.g., Koenig et al., 1998; Sireci & Talento-Miller, 2006). A lack of differential predictive validity is often taken as evidence of a lack of bias (and hence, fairness) in testing. However, that conclusion is controversial because of the limitations of statistical analysis of test bias (Linn, 1984; Helms, 2006). A lack of construct equivalence as a source of unfairness in assessment is usually evaluated by statistical analysis of the invariance of the dimensionality (internal structure) of test data across subgroups of examinees (Berman, Haertel, & Pellegrino, 2020; Winter, 2010). Differential test structure may suggest construct nonequivalence across groups. That is, the test may be measuring something different in one group relative to another. DIF procedures also investigate a lack of equivalence across subgroups, but at the item level, rather than at the structural (dimensionality) level. The 2014 Standards’ final view of fairness as “validity of individual test score interpretations for the intended uses” calls attention to the fact that testing situations may interact with many personal characteristics of examinees that could affect interpretation of their test scores. The Standards point out, “it is particularly important, when drawing inferences about an examinee’s skills or abilities, to take into account the individual characteristics of the test taker and how these characteristics may interact with the contextual features of the testing situation” (p. 53). This general caution in interpreting test scores reminds us the Standards are not able to identify all fairness issues that may apply to some individuals. However, in that same section, the Standards deviate from interpretation of individual scores to the interpretation of scores for people from identifiable subgroups. For example, in describing observed score differences across groups, such as those resulting in adverse impact, they state: the Standards’ measurement perspective explicitly excludes one common view of fairness in public discourse: fairness as the equality of testing outcomes for relevant test-taker subgroups. Certainly, most testing professionals agree that group differences in testing outcomes should trigger heightened scrutiny for possible sources of test bias. … However, group differences in outcomes do not in themselves indicate that a testing application is biased or unfair. (p. 54)

Evolving Notions of Fairness in Testing 119

Thus, as was the case with the 1999 version, the idea of fairness as equity in outcomes was not endorsed by the 2014 Standards. Essentially, the logic is that if no sources of bias can be found; that is, if there is no evidence of unfairness, group differences may simply reflect true group differences. The obvious counterargument to that logic is this: bias exists, but the source of bias has not yet been found. That issue brings us to an intersection of test score validity and fairness, which we discuss next.

Validity and Fairness The 2014 Standards claim that “Fairness is a fundamental validity issue and requires attention throughout all stages of test development and use” (p. 49). These Standards define validity as “the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests” (p. 11). From these descriptions it is clear the Standards consider fairness to be interlinked with validity. However, Helms (2006) pointed out that “Validity evidence is necessary but not sufficient for proving fairness” (p. 848). To illustrate that point, she developed the “individual differences fairness model” (Helms, 2004, 2006), in which she defines fairness in testing as: the removal from test scores of systematic variance, attributable to the test takers’ psychological characteristics, developed in response to socialization practices or environmental conditions, that are irrelevant to measurement of the construct of interest to the test user or assessor. … The term psychological characteristics refers to any of the aspects of people that are considered to be central to the work of psychologists, including attitudes, behaviors, emotions, and psychological processes. (p. 847) Helms’ individual differences fairness model focuses primarily on racial/ethnic differences in test scores, such as adverse impact (large group differences in passing rates or selection rates across minority and nonminority groups). She points out that when large test score differences across racial/ethnic groups are observed, the Standards call for research to be conducted, but such research is very hard to find. As she put it, It is clear that the Standards for Educational and Psychological Testing treats subgroups’ mean scores as potential sites for unfairness, but it does not provide pragmatic strategies for identifying or removing unfairness from individual test takers’ scores if construct-irrelevant variance is discovered. (Helms, 2006, p. 846) Helms further argued that the traditional models of test fairness that focus on differential predictive validity miss the mark, because

120 Stephen G. Sireci and Jennifer Randall

the focus of these models is on the adverse consequences of using potentially unfair test scores rather than the consequences of such scores. … This model differs from the [individual differences] model of fairness in that they erroneously treat racial groups as meaningful constructs. (p. 848) Instead of studies of differential predictive validity and measurement invariance, she calls for “Replacement of racial and ethnic categories with cultural constructs derived from conceptual frameworks [as] a necessary condition for fair assessment” (p. 848). Helms (2006) summarized traditional and individual difference models of fairness by stating: Fairness, validity, and test bias are often used interchangeably in the measurement literature where racial or cultural constructs are concerned. One problem with such usage is that psychologists erroneously believe that when they have studied or have evidence of validity, they also have evidence of fairness and lack of test bias. Yet although fairness and validity are related, they are not interchangeable. Similarly, fairness and lack of test bias are related, but they need not be synonymous. (p. 848)

Summarizing psychometric perspectives on fairness To summarize the psychometric conceptualizations of fairness, concerns over testing fairness have evolved to the point where the professional standards in the measurement field require concerns of fairness be considered from the earliest stages of test construction (i.e., defining testing purposes, defining the construct to be measured, developing and evaluating items; Bond, 1987) and throughout the test administration, scoring, and interpretation processes. Although we focused on fairness as described in the Standards for Educational and Psychological Testing, other professional guidelines (e.g., International Test Commission, 2013) essentially parallel the guidance provided in the AERA, APA, and NCME Standards. The evolution of fairness in the AERA et al. Standards has made concerns for fairness in testing linguistic minorities and individuals with disabilities paramount, and has explicitly discarded the notion of equality in outcomes as a requirement for test fairness. To address issues of fairness, the 2014 Standards (and the International Test Commission’s 2013 Guidelines on Test Use) emphasize and encourage comprehensive statistical evaluations of specific aspects of fairness such as differential item functioning, differential prediction, and a lack of construct equivalence. Thus, tools are available for evaluating many aspects of test fairness. However, in a review of fairness practices, Johnson, Trantham, and Usher-Tate (2019) found

Evolving Notions of Fairness in Testing 121

that the guidance and statistical analyses recommended by these standards and guidelines are not typically followed in practice. Thus far, we have discussed the history of fairness in testing with an eye toward guidelines and methodological advancements. We have highlighted how the field of educational measurement has conceived “fairness” in the last century and its efforts to dictate standards for best practices necessary to ensure this fairness. The remainder of this chapter focuses on the history of fairness in testing in practice— before and after the establishment of testing standards. We provide the reader with specific examples that represent the constantly evolving notions of fairness highlighting how these notions can vary (or have historically varied) across communities often depending upon one’s socio-cultural identity. In other words, we discuss how the very act of testing has been used to promote both fair and unfair purposes.

The De-Evolution of Intelligence Testing A discussion of the history of fairness in testing must include an historical account of the development and use of, what would eventually be called IQ tests. In 1904 Alfred Binet, Director of Experimental Psychology at the University of Paris, was charged by the minister of public education in Paris to develop an assessment protocol that would identify students in need of special education. His goal was to separate intelligence from instruction to avoid privileging students who had access to private schools, tutors, and educated parents. Thus, his goal was consistent with a pursuit of fairness in testing that would provide an objective and impartial system to match students to the instruction that was best suited to their needs. It is important to note that Binet was completely opposed to the use of the Binet-Simon scale for measuring intelligence. Gould (1996) outlined Binet’s three cardinal principles for using his tests: 1.

2.

3.

The scores are a practical device; they do not buttress any theory of intellect. They do not define anything innate or permanent. We may not designate what they measure as “intelligence” or any other reified entity. The scale is a rough, empirical guide for identifying mildly retarded and learning-disabled children who need special help. It is not a device for ranking normal children. Whatever the cause of difficulty in children identified for help, emphasis shall be placed upon improvement through special training. Low scores shall not be used to mark children as innately incapable. (p.185)

Despite these principles, the Binet scales were used and modified to measure what was eventually conveniently titled “intelligence.” The American version of these scales developed by Lewis Terman (see Terman, 1924; Terman & Childs,

122 Stephen G. Sireci and Jennifer Randall

1912) was embraced by the eugenics movement, which used them to classify groups of people as mentally superior or inferior (Gould, 1996). Essentially, Binet’s scales were used and modified to prevent and disrupt the immigration of undesirable populations, establish/maintain a social class system; and further to deceptively perpetuate racial/ethnic biases and stereotypes. Fairness concerns over the degree to which these tests were measuring dominant American culture and acquired knowledge were not seriously considered until decades later. H. H. Goddard, Director of Research at the Vineland Training School for Feeble-minded Girls and Boys in New Jersey, was the first researcher to popularize the use of the Binet scale in the U.S. and, unfortunately, the first to misuse the scale. Instead of simply using Binet’s scale to rank-order students and identify those just below the normal range to support instruction, Goddard regarded individual scores as a measure of some single and innate entity that could be used to segregate and prevent inter-group propagation. In 1913, he deployed two women to Ellis Island for several weeks with the task of identifying the feebleminded by sight alone. During that time over 150 immigrants were tested from four ethnicities—Jews, Hungarians, Italians, and Russians; most of whom spoke no English, were poor, and had never attended school. They were taken immediately upon arriving at Ellis Island and asked to perform a series of tasks with little understanding of the purpose or use of the information to be gleaned from these tasks. Contrary to contemporary concerns over test fairness, individuals who had never held a pen or pencil were asked to draw from memory a picture they saw briefly or to identify the date after weeks on a grueling voyage. To wit, researchers found that the vast majority (over 75% of each group) were deemed feeble-minded (below mental age 12). Goddard used his “findings” to argue for more restrictive admission practices, although he did maintain the need to secure a work force of what he referred to as morons, (i.e., high-grade defectives who could be trained to function in society). In 1928 Goddard reversed his position to better align it with that of Binet, noting that moronism was, indeed, curable and that morons did not need to be segregated from society (i.e., institutionalized). Around the same time, Terman, a Stanford University professor, popularized Binet’s scale in America, extending it in 1916 from mid-teen-age years to “superior adults.” This revised scale would come to be known as the Stanford-Binet and serve as the prototype for most IQ tests to follow. It was Terman who scaled the test so that average children would score 100 and the standard deviation would be 15 or 16 points. Although Terman agreed with Binet (to an extent) that the scale was best used to identify what he referred to as “high-grade defectives,” he did not want to do so for instructional purposes. Instead, he proposed identifying these individuals for the sole purpose of curtailing their reproductive freedom in an effort to eliminate crime, poverty, and inefficiency. Colonel Robert M. Yerkes, working with Terman, Goddard, and other hereditarians in 1917, developed army mental tests (cf. Bunch, Ch. 4, this volume).

Evolving Notions of Fairness in Testing 123

Recruits who were literate were administered the Army Alpha Exam, and recruits who could not read or who failed the Alpha Exam were administered the Army Beta. Those who failed the Army Beta Exam were examined individually using some modified version of the Binet scales. Both the Alpha (8 parts) and Beta (7 parts) exam took less than an hour to complete. The Alpha exam included items like completing numbers in a sequence (18 14 17 13 16 12), unscrambling sentences (happy is man sick always a), and analogies (eye — head. window — key, floor, room, door). The Beta exam included tasks requiring examinees to complete X and O patterns, picture completion, shape combinations (to form a square), and completion of mazes. Army psychologists would rate each examinee on a scale of A to E with ratings of D and E representing recruits who did not hold the requisite intellect to engage in activities that required “resourcefulness” or “sustained alertness.” Although the tests results were not widely used or accepted by most army personnel (with the exception of identifying a cut for army officer training), Yerkes did succeed in establishing a need for, and acceptance of, large scale psychological testing (Yerkes, 1921). Indeed, he was able to administer 1.75 million tests to army recruits during World War I, representing the first mass produced tests of intelligence. His work represents the first example of large-scale testing use and misuse in the U.S. The rapid de-evolution of intelligence testing from an objective individualized assessment system meant to inform instructional practices to a large-scale, pseudo-scientific process used to reify racist and xenophobic stereotypes and fears succinctly characterizes the struggle for fairness in the history of testing.

The History of Tests as Tools of Oppression Although the inappropriateness of Goddard, Terman, and Yerkes’ test practices and use has long been established (e.g., Gould, 1996), the history of unfair, and unethical, testing practices perhaps began, but did not end, with these three men. In this section, we describe a more recent history of testing that used tests to promote and maintain systems of unfairness. The examples we describe pertain to the use of tests to (a) restrict reproductive rights; (b) disenfranchise all Black and some immigrant citizens; (c) restrict the educational and vocational access of minority persons; and (d) promote racialized agendas and reify stereotypes.

Tests Used to Restrict Reproductive Rights We begin our discussion of unfair test use with a particularly oppressive example, which is related to use of tests in the eugenics movement; i.e. the use of intelligence tests to justify the sterilization of individuals who were judged to be “morons, imbeciles, and idiots.” In fact, in 1927 the U.S. Supreme Court ruled that a state could use compulsory sterilization of the intellectually disabled for the “protection and health of the state” (Buck v. Bell, 1927). This practice, which

124 Stephen G. Sireci and Jennifer Randall

continued in some states through 2012, led to the sterilization of an estimated 60,000 individuals who scored in the mentally retarded range on IQ tests.

Tests Used to Disenfranchise The history of unfairness is perhaps most conspicuous when considering the historical legacy of creating and using tests as barriers to access to opportunities for historically marginalized populations. Perhaps the best example is literacy tests used to disenfranchise voters from Black and immigrant communities. Literacy tests—which ostensibly required all would-be voters to read or write a predetermined text to the satisfaction of voter registrars—were purported to be race-neutral (i.e., fair), but in practice these tests allowed each registrar to use whatever subjective measures he chose and determine how (or even whether) to enforce the tests. From the 1850s until the Voting Rights Act banned certain discriminatory tests nationwide in 1970, states used literacy tests to disenfranchise otherwise eligible voters including Asian and Latino immigrants in the western U.S., Southern and Eastern European immigrants in the Northeast, and Black people in the South. In the South in particular, literacy tests were of little to no consequence to White residents but served the primary purpose to bar Black citizens from registering and, consequently, voting. Literacy tests as a requirement for voter registration were originally conceived as a way to combat voter fraud and political insurgents. In the northern U.S., claims of fraud, though rarely substantiated, were often associated with urban Democratic political machines, which were primarily supported by newly arrived European immigrants. By 1900, 25% of voting-eligible age males had been born abroad; two-thirds of these in non-English speaking countries. Literacy tests, then, became an incredibly efficient way to disenfranchise those immigrants. Ross (2014) noted that in the 1850s, for instance, Irish voters were the targets for Connecticut and Massachusetts literacy tests. In fact, Ross pointed out that the nine other non-Southern states with literacy tests in 1960—including Arizona, California, and New York—all boasted high proportions of Italian, Jewish, and other ethnic minority citizens. In the southern U.S., literacy tests were instituted primarily to disenfranchise Black Americans. After Reconstruction, more Black men than White were registered to vote; and these newly freed slaves exercised their rights to vote. Post-reconstruction, when former Confederates regained power, southern states began to use literacy tests, which were fair on their face, to revoke the voting rights of Black men. This tactical approach to disenfranchisement was, indeed, successful given that 75% of Black Americans (who were formerly slaves and legally prohibited from learning to read) were illiterate. Still, this blatant attempt to keep Black men away from the polls had the unintended consequence of barring 20% of White men who were also illiterate. To that end, grandfather clauses were enacted which allowed one to vote if his father or grandfather voted before 1867. When the U.S. Supreme Court found grandfather clauses to be unconstitutional in 1944, many states (those that heeded the court’s judgment at all) revised

Evolving Notions of Fairness in Testing 125

their literacy tests requiring, instead, understanding tests (interpret a passage from a state or federal constitution to the satisfaction of the registrar), good moral character tests (verify one’s good moral character to the satisfaction of the registrar), and voucher tests (receive a voucher of character from a currently registered voter). On the surface, these tests were intended to appear to be race-neutral, but relied almost entirely on the subjective judgements of registrars, all of whom were White and few of whom were committed to equity. Clearly, contemporary ideas of standardization and fairness in scoring were intentionally nonexistent. In addition to the inherent issues of inequity resulting from the use of highly subjective and biased evaluation criteria by registrars, the very definition of the construct of literacy also varied across states (another violation of contemporary notions of fairness). Indeed, in some cases the construct of basic literacy would be better defined as constitutional law, geometry, or trivial nonsense. For example, a 1965 Alabama Literacy test asked 68 questions including:    

A U.S. senator elected at the general election in November takes office the following year on what date?; Appropriation of money for the armed services can be only for a period limited to _____ years; The only laws which can be passed to apply to an area in a federal arsenal are those passed by ____ provided consent for the purchase of the land is given by the ___; and Name two of the purposes of the U.S. Constitution (Ferris, n.d.).

A 1965 literacy test for the state of Louisiana was similarly far-reaching with questions such as:   

Draw a triangle with a blackened circle that only overlaps its left corner; Spell backwards, forwards; and Divide a vertical line in two equal parts by bisecting it with a curved horizontal line that is only straight at its spot bisection of the vertical (see Ferris, n.d.).

In sum, the so-called literacy tests barely tested a scientifically defined construct (literacy), were subject to questionable interpretation (i.e., lack of objectivity in scoring), and intended to disenfranchise Black Americans and some immigrant groups. These exams stood for decades; and remain, for many marginalized groups, an example of the oppressive intent of government-sponsored testing programs; that is, striking and enduring examples of unfairness in American society.

Tests Used to Restrict Educational and Vocational Access In this section, we describe how tests, specifically intelligence tests, have been used to restrict, or prevent, both educational (as in the case of Black and Hispanic students) and vocational (as in the case of minority workers) access. The Stanford-

126 Stephen G. Sireci and Jennifer Randall

Binet and Weschsler scales are currently (and have been) the most widely used IQ tests in American schools (Stinnet, Harvey, & Oehler-Stinnet, 1994). The majority of these tests are administered for the purpose of determining whether a child should be considered for special education (Flynn, 1985, 2000). The use of standardized intelligence tests to inform educational placement decisions is rooted in the notion that differences in scores on these tests represent a true and fundamental difference in intellect (current & potential), and any educational intervention should be designed to address these differences. Historically, however, the use of these tests increased the likelihood that lowincome minority students would be evaluated as “low ability” and placed in remedial or special education courses, while high-income White students were more likely to be evaluated as “gifted” and placed into enrichment programs (Slavin, 1987; Darling-Hammond, 1986). Undoubtedly, these initial placements impacted the courses students were allowed to take and the content to which they were exposed, thereby impacting their achievement (Lee & Bryk, 1988; Oakes, 1990). In fact, in 1969 the Association of Black Psychologists called for a moratorium on the administration of ability to all Black students. It charged that the tests: 1. 2. 3. 4. 5. 6.

Label Black children as uneducable Place Black children in Special classes Potentiate inferior education Assign Black children to lower education tracks than whites Deny Black children higher educational opportunities Destroy positive intellectual growth and development of Black children (p. 67)

Williams (1971) described the use of intelligence tests to track Black students as the “intellectual genocide of Black children.” The use of these tests to assign minority students inappropriately and disproportionately to special education classes and/or remedial tracks has also been well-documented, through the U.S. court system. In 1967, the Board of Education for the District of Columbia (Hobson v. Hansen, 1967) was sued for using the results of a standardized aptitude test in early elementary school to place Black and low-income students in the lowest academic tracks. These tracks were intended to steer students to the lower-paying blue-collar careers (Hobson v. Hansen, 1967). The Court’s ruling disallowed that practice. In 1970, the California State Board of Education was sued for testing MexicanAmerican students, whose native and primary language was Spanish, in English. Nine Mexican-American students were classified as mentally retarded using the Wechsler Intelligence Scale for Children or the Stanford-Binet (Diana v. State Board of Education, 1970). Upon retesting by a bilingual test administrator, their scores increased by one standard deviation. The Court ruled that students must be tested in their native language.

Evolving Notions of Fairness in Testing 127

In 1979, the state of California was sued again for using IQ tests to label young Black students disproportionately as “educable mentally retarded,” consequently placing them into classrooms that focused on personal hygiene and basic home and community living skills as opposed to academics (Larry P. v. Riles). As in the other cases, the courts found these intelligence tests to be discriminatory (i.e., unfair) in that they were designed for and normed primarily by White individuals. In addition, the Larry P. case made intelligence testing of African American students illegal in California – a ruling that stands to this day. The use of tests to restrict or deny vocational opportunities has also been famously documented primarily through the U.S. court system. In 1971, Black workers at the Duke Power Company sued the company for requiring employees who were transferring between departments to obtain a minimum score on two separate intelligence tests (Griggs v. Duke Power Co., 1971). The Supreme Court ruled that the tests were artificial and unnecessary; and, most importantly, had a disparate impact on Black employees. The Court also noted that the subtle intent of the policy was simply to provide White employees with the best job opportunities. This case established the precedent for disparate impact in lawsuits involving racial discrimination. In 1976, five minority candidates for a position at the Golden Rule Insurance Company sued the Illinois Department of Insurance and the test developer (ETS) claiming that the insurance licensing examination both discriminated against Black test takers and assessed content unrelated to the job. The candidates noted that 77% of White applicants passed the exam; whereas only 52% of Black candidates did so. Eventually settled out of court in 1984 with the test developer agreeing to change the exam, this case became known as the Golden Rule Settlement requiring test developers to include group differences in performances in item analysis and selection in an operational setting (Bond, 1987).

Tests Used to Promote Racist Agendas and Reify Stereotypes Binet’s initial IQ test was developed with the needs of White students in mind, and so it reflected White Eurocentric values and ways of knowing. Critics have long questioned the use of these scales beyond White middle class populations, if at all (see Banks, 1976; Baratz & Baratz, 1970; Gay, 1975; Gordon & Terrell, 1981; Stone, 1975 for examples) as they historically and repeatedly failed to account for ways of knowing/thinking of minority groups, Gould (1996) provided the following example from Terman’s assessment: “An Indian who had come to town for the first time in his life saw a white man riding along the street. As the white man rode by, the Indian said, ‘The white man is lazy; he walks sitting down.’ What was the white man riding on that caused the Indian to say, ‘He walks sitting down’?” One could imagine any number of appropriate responses to this prompt including a horse (wrong), a person in a wheelchair (wrong) or riding a bicycle (the only allowable correct answer).

128 Stephen G. Sireci and Jennifer Randall

Few would argue that the above example from the turn of the 20th century represents a fair assessment of one’s ability. In fact, since the widespread use of Terman’s and Yerkes’ tests, there has been considerable cross-cultural research confirming that the construct of intelligence can differ fundamentally across cultures (Cole, Gay, Glick, & Sharp, 1971; Benson, 2003; Williams, 1971) with contemporary researchers pointing to examples of these fundamental differences in African (see Serpell, 2011 for example) and Eastern (see Nisbett 2003) communities. Unfortunately, as both intelligence and achievement tests evolve—in part as a response to calls for a fairer system—they remain unable to account for differences in performance attributable to the socio-cultural context in which the examinee resides. Instead both large-scale and classroom assessments continue to privilege the language and ways of expression of the dominant ruling class which serves to reify or perpetuate, often racialized, stereotypes. For example, White mainstream English (WME) is the lingua franca of the vast majority of U.S. public schools and their tests. This artificial linguistic hierarchy— in which WME is considered, or promoted to, standard English for the purposes of reading and writing assessment across the nation (see the Massachusetts Curriculum Framework for English Language Arts, 2017 or the Georgia Department of Education’s Standards of Excellence, 2015 for examples) unfairly privileges speakers of WME. Consequently, the highly-developed complex nature of a student’s Black English, for example, is considered improper or deficient; thereby reifying the stereotype of the illiterate, uneducable Black student. In 1994 Hernstein and Murray famously argued, among many things, that the U.S. should do more to encourage high-IQ women to have babies, just as it did to encourage low-IQ women to have babies through its welfare policies. Such a recommendation, of course, assumes that individuals living in poverty are inherently poverty-stricken because they are less able and not due to social and structural inequities. More importantly, however, it reifies the stereotype of the lazy, welfare mom (a characteristic often associated with Black or Hispanic status) who seeks to drain the public’s resources through irresponsible behavior (i.e. having children).

Using Knowledge of the Past to Move Forward In this chapter, we traced the history of conceptualizations of fairness in both common language and in educational testing. We illustrated how concerns for testing fairness evolved over time often at the behest of the American court system and how tests have historically been used to perpetuate a façade of fairness while simultaneously masking their decidedly unfair intent (e.g. literacy tests). Despite what could be described as a concerted and ongoing effort to improve test fairness, one could still argue in 2021 that current assessments—both small and large scale—continue, however unintentionally, to reify stereotypes and preconceived notions about the achievement level and potential of minority

Evolving Notions of Fairness in Testing 129

populations. Undoubtedly the questions that are asked and the types of responses that are deemed appropriate to these questions often reflect the values and the language of the majority. Current advocates for both culturally responsive/sustaining pedagogy and assessment often point to other examples of insensitive assessment practices. For example, a 2019 10th-grade state examination asked students to write a journal entry from the perspective of a White woman who used derogatory language against a runaway slave who sought to hide in her home, referring to her as stupid and telling her she smells (Dwyer, 2019). This culturally insensitive approach to assessment extends beyond U.S. borders as testing companies and the U.S. government (e.g., USAID) have invested large sums of money into developing literacy and numeracy assessments for students enrolled in schools all over the developing world with little regard for, or attention to, the socio-cultural and historical context in which these students live. Gould (1996), in referring to Yerkes’ Army Alpha and Beta exams, asked, “What was it like to be an illiterate black or foreign recruit, anxious and befuddled at the novel experience of taking an examination, never told why, or what would be made of the results: expulsion, the front lines?” (p. 234). A similar argument can be made today when requiring young students in this context to take written exams or respond to exam questions on paper, when their schools do not have, and have likely never had, access to paper, as is often the case in the developing world. Moreover, given the nature of development work (i.e. Western non-profit organizations and governments providing funding and demanding objective evidence of a project’s success), students in these countries are often subjected to a series of these standardized paper-based assessments in an effort to measure their progress. These unfamiliar, atypical assessments, by their nature, are administered by enumerators who are nearly always “outsiders” whose purposes are unexplained/unclear to the students subjected to the assessments. Thus, just as Yerkes’ results perpetuated/reinforced racial and national stereotypes with its biased and unsubstantiated results, so do assessments like the Early Grades Reading Assessment (EGRA) or Early Grades Mathematics Assessment (EGMA) (RTI, 2009a, 2009b), which are used nearly exclusively within the context of international development. For example, Randall and Garcia (2016) reported that use of the EGMA/EGRA assessments in the Democratic Republic of the Congo (DRC) to evaluate student progress was inappropriate in that context, because students tended to be taught in and learn as groups—choral response is used to assess student knowledge in schools. Rarely, if ever, are students expected to take individual assessments or respond one-on-one with an instructor. The authors wrote: “To use an assessment that requires students to sit individually with an unknown examiner is counter to their previous experience and is likely to result in less valid score interpretations” (p. 97). Indeed, the current administration practices of the EGMA/EGRA are eerily similar to those of Yerkes’ Alpha and Beta Exams.

130 Stephen G. Sireci and Jennifer Randall

Still, a common argument with a long history in the field of educational assessment is that differences in student-group performances can often be attributed to varying opportunities to learn or differential environments (e.g., Linn & Drasgow, 1987), such as lack of access to paper, computers, etc. Indeed, assessment proponents assert that because White students, on average, live in households with higher incomes and with parents with higher degrees of education, it makes sense that these students would earn higher test scores regardless of the assessment. More recently, Kempf (2016) wrote that standardized tests “fail to account for economic, cultural, other differences between students. The tests equate advantage with intelligence even when such advantage was gained outside of the classroom (e.g. a child who spends the weekend at museums has an advantage over a child who cannot afford educational extracurriculars).” We argue that such arguments oversimplify the core issues and, indeed, represent a deficit model to thinking about students of color and other marginalized populations and their learning. That is, these arguments extend a legacy of unfairness in testing. Instead of such arguments we encourage test developers and researchers to examine their approaches to test development and consider an approach that does not advantage certain (i.e., White-centered) ways of knowing; and instead values and calls on the ways of knowing and meaning-making that exist in these communities. Williams (1971) wrote: the differences noted by psychologists in intelligence testing, family, and social organizations and the studies of the Black community are not the result of pathology, faculty learning, or genetic inferiority. These differences are manifestations of a viable and well-delineated culture of the Black American. (p. 65) One way to reduce this stream of unfairness in testing may be by incorporating the suggestions of Hood (1998a, 1998b), Kirova and Hennig (2013), Klenowski (2009), and others who focus on culturally responsive assessment. Montenegro and Jankowski (2017) defined culturally responsive assessment as assessment that is “mindful of the student populations the institution serves, using language that is appropriate for all students when developing learning outcomes, acknowledging students’ differences in the planning phases of an assessment effort, developing and/or using assessment tools that are appropriate for different students, and being intentional in using assessment results to improve learning for all students (p.10).” Clearly, the ideas behind culturally responsive assessment are to promote fairness in assessment. Although the 2014 Standards specifically speak to the consideration of linguistic minorities, they fail to require test developers to explicitly attend to the sociocultural identities of students throughout the assessment development process. We recommend that, when addressing issues of fairness, future versions of the Standards consider a culturally sustaining framework as the starting point. Although

Evolving Notions of Fairness in Testing 131

this field is in its infancy, it offers hope to what many see as the most troubling instances of unfairness in contemporary educational assessment. We hope this chapter has not only informed readers of historical views of fairness in testing, but also served to help acknowledge the historical majority culture-centric perspectives of assessment development and use that perseveres to this day. The proliferation of majority culture-centric perspectives directly restricts fairness. The history of fairness in testing illustrates gradual improvements in the understanding of fairness issues over time, and in avoiding unfair testing practices and the use of tests for nefarious purposes. We hope this chapter contributes to improvements in these areas by informing the testing community of current and historical instances of unfairness in testing and its potential detriment to society. We also hope the chapter inspires and encourages assessment professionals to reflect on this history to develop tests and testing practices designed to promote fairness for all examinees.

Note 1 The authors wish to thank Corey Palermo for his thoughtful review of an earlier draft of this chapter and express their gratitude to Asa Hilliard who was an inspiring pioneer in pursuing fairness in testing.

References American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (1985). Standards for educational and psychological testing. Washington, DC: American Psychological Association. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. American Psychological Association (1954). Technical recommendations for psychological tests and diagnostic techniques. Psychological Bulletin, 51, (2, supplement). American Psychological Association (1966). Standards for educational and psychological tests and manuals. Washington, DC: Author. American Psychological Association, American Educational Research Association, & National Council on Measurement in Education (1974). Standards for educational and psychological tests. Washington, DC: American Psychological Association. Banks, W. C. (1976). White preference in Blacks: A paradigm in search of a phenomenon. Psychological Bulletin, 83, 1179–1186. Baratz, S., & Baratz, J. (1970). Early childhood intervention: The social science base of institutional racism. Harvard Educational Review, 40, 29–60. Benson, E. (2003). Intelligence across cultures: Research in Africa, Asia, and Latin America is showing how culture and intelligence interact. Monitor on Psychology, 34(2), p.56.

132 Stephen G. Sireci and Jennifer Randall

Berman, A., Haertel, E., & Pellegrino, J. (2020). Comparability of Large-scale educational assessments: Issues and recommendations. Washington, DC: National Academy of Education Press. Binet, A., & Simon, T. (1905). Methodes nouvelles pour le diagnostic du niveau intellectuel desanormaux. L’Annee psychologique, 11, 245–336. Bond, L. (1987). The Golden Rule Settlement: A Minority Perspective. Educational Measurement: Issues and Practice, 6(2) 18–20. Buck v. Bell, 274 U.S. 200 (1927). Cole, M., Gay, J., Glick, J. A., & Sharp, D. W. (1971). The cultural context of learning and thinking. New York: Basic Books. Darling-Hammond, L. (1986). Equality and excellence: The educational status of Black Americans. New York: College Entrance Examination Board. Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differentially functioning test items. Educational Measurement: Issues and Practice, 17(1), 31–44. https:// doi.org/10.1111/j.1745-3992.1998.tb00619.x Debra P. v. Turlington, 474 F. Supp. 244 (M.D. Fla. 1979), aff’d in part, rev’d in part, 644 F.2nd 397 (5th Cir. 1981); on remand, 564 F.Supp. 177 (M.D. Fla. 1983), aff’d, 730 F.2d 1405 (11th Cir. 1984). Diana v. State Board of Education (1970) No. C-70, RFT, (N.D. Cal 1970). California Dwyer, D. (2019, April 4). State Pulls ‘Racially Troubling’ Question from 10th Grade MCAS Exam Following Complaints. Retrieved from https://www.boston.com/news/ education/2019/04/04/mcas-question-underground-railroad on June 14, 2020. Ferris State University (n.d.). Copy of the 1965 Alabama Literacy Test. https://www.ferris. edu/HTMLS/news/jimcrow/pdfs-docs/origins/al_literacy.pdf. Ferris State University (n.d.) Copy of the Louisiana Literacy Test. https://www.ferris.edu/ HTMLS/news/jimcrow/question/2012/pdfs-docs/literacytest.pdf.. Flynn, J. R. (1985). Weschler intelligence tests: Do we really have a criterion of mental retardation? American Journal of Mental Deficiency, 90, 236–244. Flynn, J. R. (2000). The hidden history of IQ and special education: Can the problems be solved? Psychology, Public Policy, and Law, 6, 191–198. Gay, G. (1975). Cultural differences important in education of Black children. Momentum, 2, 30–33. Georgia Department of Education (2015). Georgia Standards of Excellence, English Language Arts. Retrieved from https://www.georgiastandards.org/Georgia-Standards/Docum ents/ELA-Standards-Grades-9-12.pdf on June 14, 2020. Gordon, E. W., & Terrell, M. (1981). The changed social context of testing. American Psychologist, 86, 1167–1171. Gould, S. J. (1996). The mismeasure of man. New York: W.W. Norton & Co. Griggs v. Duke Power Company (1971). 401 US 424, Docket # 124. Helms, J. (2004). Fair and valid use of educational testing in grades K-12. In J. Wall & G. R. Walz (Eds.), Measuring up: Assessment issues for teachers, counselors, and administrators (pp. 81–88). Greensboro, NC: CAPS Press. Helms, J. (2006). Fairness is not validity or cultural bias in racial-group assessment: A quantitative perspective. American Psychologist, 61(8), 845–859. Hernstein, R. & Murray, C. (1994). The bell curve: Intelligence and class structure in American life. New York: Free Press. Hobson v. Hansen (1967). 260 F Supp. 402 (D. C. C. 1967) Hood, S. (1998a). Culturally responsive performance-based assessment: conceptual ad psychometric considerations. Journal of Negro Education, 67(3), 187–196.

Evolving Notions of Fairness in Testing 133

Hood, S. (1998b). Introduction and overview. Assessment in the context of culture and pedagogy: a collaborative effort, a meaningful goal. Journal of Negro Education, 67(3), 184–186. International Test Commission (2013). ITC guidelines on test use (version 1.2). Retrieved from https://www.intestcom.org/files/guideline_test_use.pdf on April 20, 2020. Johnson, J. L., Trantham, P., & Usher-Tate, B. J. (2019). An evaluative framework for reviewing fairness standards and practices in educational tests. Educational Measurement: Issues and Practice, 38(3), 6–19. Kempf, A. (2016). The pedagogy of standardized testing: The radical impacts of educational standardization in the US and Canada. United Kingdom: Palgrave Macmillan. Kirova, A. & Hennig, K. (2013). Culturally responsive assessment practices: examples from an intercultural multilingual early learning program for newcomer children. Power and Education, 5(2), 106–119. Klenowski, V. (2009). Australian indigenous students: addressing equity issues in assessment. Teaching Education, 20(1), 77–93. Koenig, J. A., Sireci, S. G., & Wiley, A. (1998). Evaluating the predictive validity of MCAT scores across diverse applicant groups. Academic Medicine, 73, 65–76. Larry P. v. Riles (1979). No. C-71–2270 RFP California Lee, V. E., & Bryk, A. S. (1988). Curriculum tracking as mediating the social distribution of high school. Sociology of Education, 62, 78–94. Linn, R. L. (1984). Selection bias: Multiple meanings. Journal of Educational Measurement, 21, 33–47. Linn, R. & Drasgow, F. (1987). Implications of the Golden Rule Settlement for Test Construction. Educational Measurement: Issues and Practice, 6(2) 13–17. Massachusetts Department of Elementary and Secondary Education (2017). Curriculum Framework, English Language Arts and Literacy. Retrieved from http://www.doe.mass. edu/frameworks/ela/2017-06.pdf on June 14, 2020. Merriam-Webster (2020). Definition of “Fair”. https://www.merriam-webster.com/dic tionary/fair. Messick, S. (1989). Validity. In R. Linn (Ed.), Educational measurement, (3rd ed., pp. 13– 100). Washington, DC: American Council on Education. Montenegro, E. & Jankowski, N. (2017). Equity and Assessment: Moving Towards Culturally Responsive Assessment (Occasional Paper No. 29). Urbana, IL: University of Illinois and Indiana University, National Institute for Learning Outcomes Assessment (NILOA). Nisbett, R. (2003). The Geography of Thought: How Asians and Westerners Think Differently… and Why. New York: Free Press. Oakes, J. (1990). Multiplying inequalities: The effects of race, social class, ad tracking on opportunities to learn mathematics and science. Santa Monica: The Rand Corporation. Ramsey, P. A. (1993). Sensitivity review: The ETS experience as a case study. In P. W. Holland & H. Wainer (Eds.), Differential item functioning (pp. 367–388). Hillsdale, NJ: Erlbaum. Randall, J., & Garcia, A. (2016). The history of testing special populations. In C. Wells & M. F. Bond (Eds.), Educational measurement: From foundations to future (pp. 373–394). New York: Guilford Press. Ross, D. (2014). Pouring old poison into new bottles: How discretion and the discriminatory administration of voter ID laws recreate literacy tests. Columbia Human Rights Law Review, 45(2), 362–440. Retrieved from https://search-ebscohost-com.silk.library.umass.edu/login. aspx?direct=true&db=cja&AN=94863536&site=eds-live&scope=site.

134 Stephen G. Sireci and Jennifer Randall

RTI International (2009a). Early Grade Mathematics Assessment (EGMA): A conceptual framework based on mathematics skills development in children. Retrieved from https://pdf.usaid. gov/pdf_docs/Pnads439.pdf. RTI International (2009b). Early Grade Reading Assessment toolkit. Prepared for the World Bank, Office of Human Development, under Contract No. 7141961. Research Triangle Park, NC: Author. Retrieved from https://s3.amazonaws.com/inee-assets/resources/ EGRA_Toolkit_Mar09.pdf. Sanchez, E. I. (2013). Differential effects of using ACT college readiness assessment scores and high school GPA to predict first-year college GPA among racial/ethnic, gender, and income groups. ACT research report series 2013–4. Iowa City: ACT. Serpell, R. (2011) Social responsibility as a dimension of intelligence, and as an educational goal: Insights from programmatic research in an African society. Child Development Perspectives, 5(2), 126–133. Sireci, S. G. (2005). Unlabeling the disabled: A perspective on flagging scores from accommodated test administrations. Educational Researcher, 34(1), 3–12. Sireci, S. G., & Mullane, L. A. (1994). Evaluating test fairness in licensure testing: the sensitivity review process. CLEAR Exam Review, 5(2), 22–28. Sireci, S. G., & O’Riordan, M. (2020). Comparability issues in assessing individuals with disabilities. In A.I. Berman, E. H. Haertel & J. W. Pellegrino (Eds.), Comparability of Large-scale educational assessments: Issues and recommendations (pp. 177–204). Washington, DC: National Academy of Education Press. Sireci, S. G., & Parker, P. (2006). Validity on trial: Psychometric and legal conceptualizations of validity. Educational Measurement: Issues and Practice, 25(3), 27–34. Sireci, S. G., & Rios, J. (2013). Decisions that make a difference in detecting differential item functioning. Educational Research and Evaluation, 19, 170–187. Sireci, S. G., & Talento-Miller, E. (2006). Evaluating the predictive validity of Graduate Management Admissions Test Scores. Educational and Psychological Measurement, 66, 305–317. Slavin, R. E. (1987). A review of research on elementary ability grouping. Baltimore, MD: John Hopkins University Press. Stark, S., Chernyshenko, O. S., & Drasgow, F. (2004). Examining the Effects of Differential Item (Functioning and Differential) Test Functioning on Selection Decisions: When Are Statistically Significant Effects Practically Important? Journal of Applied Psychology, 89(3), 497–508. https://doi-org.silk.library.umass.edu/10.1037/0021-9010.89. 3.497. Stinnet, T. A., Harvey, J. M., & Oehler-Stinnet, J. (1994). Current test usage by practicing school psychologist: A national survey. Journal of Psychoeducational Assessment, 12, 331–350. Stone, C. (1975). Let’s abolish IQ tests, S.A.T.‘s (and E.T.S. too.). The Black Collegian, 6, 46–56. Terman, L. M., & Childs, H.G. (1912). A tentative revision and extension of the BinetSimon measuring scale of intelligence. Journal of Educational Psychology, 3, 61–74. Terman, L. M. (1924). The mental test as a psychological method. Psychological Review, 31 (2), 93–117. doi:10.1037/h0070938. U.S. Department of Justice: Section 4 of the Voting Rights Act. https://www.justice.gov/ crt/section-4-voting-rights-act. van de Vijver, F. J. R., & Leung, K. (2011). Equivalence and bias: A review of concepts, models, and data analytic procedures. In D. Matsumoto and J. R. van de Vijver (Eds.). Cross-cultural research methods in psychology (pp. 17–45). New York, NY: Cambridge University Press.

Evolving Notions of Fairness in Testing 135

Webster’ Collegiate Dictionary (1923). Springfield, MA; G & C Merriam and Co. Available at https://babel.hathitrust.org/cgi/pt?id=uva.x000968313&view=1up&seq=412& size=125. Webster, N. (1828). An American dictionary of the English language. New York: S. Converse. Available at http://webstersdictionary1828.com. Williams, R. L. (1970). Danger: Testing and dehumanizing black children. Clinical Child Psychology Newsletter, 9(1), 5–6. Williams, R. L. (1971). Abuses and misuses in testing black children. The Counseling Psychologist, 62–73. Winter, P. (Ed.) (2010). Evaluating the comparability of scores from achievement test variations (pp. 33–68). Washington, DC: Council of Chief State School Officers. Yerkes, R. M. (1921). Memoirs of the National Academy of Sciences: Vol. 15. psychological examining in the United States Army. Washington, DC: National Academy of Sciences. Zwick, R., & Schlemer, L. (2004). SAT validity for linguistic minorities at the University of California, Santa Barbara. Educational Measurement: Issues and Practice, 23(1), 6–16.

7 A CENTURY OF TESTING CONTROVERSIES1 Rebecca Zwick

“Standardized testing is much in the news. New testing programs, test results, and criticisms of standardized testing all are regular fare in the popular media nowadays.” A timely statement, it seems, but it was made 40 years ago (Haney, 1981, p. 1021). In the years since the publication of Lee Cronbach’s (1975) article, “Five decades of public controversy over mental testing,” and Robert Linn’s (2001) “A century of standardized testing,” the debates about standardized educational tests have continued to rage. In this chapter, I consider three fundamental controversies about the use of educational testing in the United States. The first involves challenges to the use of tests that purport to measure intelligence. These disputes peaked during the 1960s and 1970s, inflamed by the writings of Arthur Jensen (e.g., Jensen, 1969), and flared up again following the publication of The Bell Curve (Herrnstein & Murray, 1994). To a degree, opposition to intelligence testing and to the associated beliefs of its more extreme proponents continues to underlie current antagonism toward the use of standardized educational tests. The second area of debate concerns the use of college admissions tests – the SAT, first administered in 1926, and the ACT, first administered in 1959. Opposition to admissions tests, mainly the SAT, has been detailed in books such as SAT Wars (Soares, 2012) and in numerous magazine and newspaper articles. The third area of controversy involves resistance to accountability testing of students in grades K–12 (cf. Bennett, 2016). The politics associated with the optout movement is complex. Accountability testing is supported by many civil rights groups that have raised objections to other types of standardized testing, while criticism of accountability testing has emerged from previous supporters of tests (e.g., Ravitch, 2010) and from within the measurement community (e.g.,

A Century of Testing Controversies 137

Koretz, 2017). The use of student test scores to evaluate teachers using “valueadded” methods has been particularly controversial. Consideration of each of these three strands of controversy is followed by a brief discussion of testing opposition and its likely future.

Controversies About the Use of Tests as Measures of “Native Intelligence” In 1904, Alfred Binet, director of the psychology laboratory at the Sorbonne, initiated the development of a set of tasks that were intended to assess children’s reasoning skills. Based on his earlier research, he had abandoned “medical” techniques for measuring intelligence, such as craniometry, and now sought to use a “psychological” approach. Binet rejected strict hereditarian theories of intelligence; he believed that intelligence could be increased through education. From his perspective, the main purpose of intelligence testing was to identify and help children who needed special education (Gould, 1996a). Binet’s tasks formed the basis for the StanfordBinet intelligence test (Terman, 1916), which was subsequently interpreted and used in ways that Binet would not have endorsed. Objections to intelligence testing are among the most deeply felt criticisms of standardized tests. It is not merely the tests themselves that have been the source of contention, but the conceptions of intelligence that are assumed to underlie them and the purposes for which these tests have been used. In particular, many early proponents of intelligence testing, along with some current ones, have subscribed to the belief that intelligence is unidimensional, highly heritable, and fixed rather than malleable. Intelligence test results have been used to disparage Black people and other ethnic groups, to argue against intermarriage between ethnic groups, and to support limits on immigration and childbearing for members of “undesirable” groups. In a biting essay, W. E. B. Du Bois suggested in 1920 that intelligence tests were the long-awaited fulfillment of “the dream of those who do not believe Negroes are human that their wish should find some scientific basis” (Du Bois, 1920). In this section, key developments in intelligence testing are discussed, focusing on developments in the 1920s, the 1960s through 1970s, and the 1990s.

The 1920s: The Expanding Role of Intelligence Tests In the early twentieth century, a group intelligence test intended for schoolchildren was introduced into the American public school system. The development effort, sponsored by the National Research Council, had been directed by five psychologists, including Lewis Terman, Robert Yerkes, and E. L. Thorndike (Whipple, 1921). The National Intelligence Tests were adapted from the Army Alpha tests, which had been administered to American military recruits during World War I and had themselves been based heavily on early intelligence tests.

138 Rebecca Zwick

The issuance of the National Intelligence Tests in their final form in 1920 served to introduce the idea of intelligence testing to a much wider public. In 1922, a remarkable debate about intelligence tests took place between Terman, the Stanford psychologist who had revised, expanded, and marketed the intelligence scale developed by Binet and his colleague Théodore Simon (McNutt, 2013), and journalist Walter Lippmann, who wrote a series of six articles on the subject in The New Republic. In the last of these, titled “A Future for the Tests,” Lippmann wrote: The claim that Mr. Terman or anyone else is measuring hereditary intelligence has no more scientific foundation than a hundred other fads, vitamins and glands and amateur psychoanalysis and correspondence courses in will power … Gradually under the impact of criticism the claim will be abandoned that a device has been invented for measuring native intelligence. Suddenly it will dawn upon the testers that this is just another form of examination … It tests … an unanalyzed mixture of native capacity, acquired habits and stored-up knowledge, and no tester knows at any moment which factor he is testing. … Once the notion is abandoned that the tests reveal pure intelligence, specific tests for specific purposes can be worked out. (Lippmann, 1922, p. 10) Terman fired back with a clumsy sarcastic piece in which he mocked Lippmann for suggesting that environmental factors might affect IQ scores: Does not Mr. Lippmann owe it to the world to abandon his role of critic and to enter this enchanting field of research? He may safely be assured that if he unravels the secret of turning low IQ’s into high ones, or even into moderately higher ones, his fame and fortune are made … I know of a certain modern Croesus who alone would probably be willing to start him off with ten or twenty million if he could only raise one particular little girl from about 60 to 70 to a paltry 100 or so. (Terman, 1922, p. 119) Although most scholars today would largely align with Lippmann’s perspective, many of Terman’s contemporaries shared his views about the nature of intelligence. One of these was Princeton psychology professor Carl Brigham. Based on his investigations of World War I military personnel, Brigham declared that immigrants, Black people, and Jews were inferior in intelligence and warned of the dangers of these groups to American society. Brigham’s associates included the prominent eugenicists Madison Grant and Charles W. Gould, whose contributions he acknowledged in his 1923 book, A Study of American Intelligence (Brigham, 1923, pp. xvii, 182–186).

A Century of Testing Controversies 139

Although Brigham later renounced his early research findings, it is for these he is best known. He also had a significant role in the history of standardized tests: As an advisor to the College Board, Brigham chaired the commission charged with developing the exam that came to be named the Scholastic Aptitude Test (SAT). Like the questions on the National Intelligence Tests, the content of the first SAT, given in 1926, was influenced by the World War I Army Alpha tests. Although today’s SAT has little in common with the first one, its historical connection to intelligence tests and to Brigham’s racist proclamations continue to be cited in condemnations of the test (Camara, 2009).

The 1960s and 1970s: Arthur Jensen, the Larry P. Case, and California’s Ban on IQ Testing During the 1960s, opposition to intelligence testing intensified, fueled by the work of Arthur Jensen, a professor of education at the University of California, Berkeley, whose role in the IQ test controversy is well chronicled by Cronbach (1975). Jensen’s work became known to the general public following the publication of an article in the Harvard Educational Review, “How much can we boost IQ and scholastic achievement?” which was excerpted in the New York Times and US News & World Report. Jensen expressed pessimism about the utility of the compensatory education programs of the day, suggesting that genetic factors might have limited their effectiveness. He pointed to “the possible importance of genetic factors in racial behavioral differences” (Jensen, 1969, p. 80) and hypothesized that “genetic factors may play a part” in performance differences between disadvantaged individuals and others (p. 82). The outrage over Jensen’s pronouncements was further inflamed by some distorted accounts in the popular press. For example, an article in the New York Times Magazine attributed to Jensen the belief that there were fewer “intelligence genes” in the Black population than in the White population (Cronbach, 1975). Cronbach (1975) asserts that Jensen offered a favorable view of compensatory education in an earlier article (Jensen, 1967). In fact, while he does not dismiss such programs entirely, Jensen expresses substantial doubt about their effectiveness, stating that “[u]nfortunately … the evidence regarding the efficacy of any of these [compensatory] programmes is still meagre. It is insufficient merely to report gains in IQ, especially when this is based on retest with the same instrument or an equivalent form of the test, and when there is a high probability that much of the gain in test scores is the result of highly specific transfer from materials and training… that closely resemble those used in the test.” He goes on to describe a program in which students were essentially taught Stanford-Binet tasks, leading to unwarranted positive conclusions about improvements in intelligence (Jensen, 1967, p. 18). He raises similar doubts about compensatory education in his book, Straight Talk About Mental Tests (Jensen, 1981). Jensen’s theories, particularly those about racial differences in IQ scores, were “very much in the background” of a key court case on IQ testing – the Larry P.

140 Rebecca Zwick

case – and were studied by the judge in that case as part of his analysis of the history of intelligence tests (Rebell, 1989, p. 149). Larry P. v. Riles (1972, 1979), a landmark court case involving the use of individual IQ tests, was initiated in 1971, when the parents of six AfricanAmerican schoolchildren filed a lawsuit in the Northern District of California against the San Francisco public schools and the state of California. The class of plaintiffs in the case was eventually expanded to include “all Black California school children who have been or may in the future be classified as mentally retarded on the basis of IQ tests.”2 The plaintiffs challenged the use of IQ tests to place students in classes for children then known as “educable mentally retarded” (EMR), contending that this process was racially biased and was the reason for the disproportionate placement of Black students in these classes, which were thought to be stigmatizing and isolating (Rebell, 1989). The named defendant, Wilson Riles, was California’s Superintendent of Public Instruction. Following a series of preliminary legal actions, the trial began in 1977. The U.S. Department of Justice moved to participate as amicus curiae, taking a position in support of the plaintiffs. The judge, Robert Peckham, ultimately ruled that the defendants intentionally discriminated against the plaintiffs, in violation of the equal protection clauses of both the state and federal constitutions, Title VI of the Civil Rights Act of 1964, and other statutes, by using standardized intelligence tests that are racially and culturally biased, have a discriminatory impact against black children, and have not been validated for the purpose of essentially permanent placements of black children into educationally dead-end, isolated, and stigmatizing classes for the so-called educable mentally retarded. Further, these federal laws have been violated by defendants’ general use of placement mechanisms that, taken together, have not been validated and result in a large over-representation of black children in the special E.M.R. classes. The court issued a permanent injunction against the use of IQ tests for placing Black children in EMR classes in California without court approval. Court approval could be obtained only if a district could show that the IQ test to be used was not biased and did not have an adverse effect on minorities. The state was also ordered to eliminate the disproportionate representation of Black students in EMR classes. Following an appeal by the defendants, the Ninth Circuit Court upheld the testing injunction but vacated the earlier judgment concerning constitutional violations. A state-imposed moratorium on all IQ testing for EMR placements, regardless of the race of the child, which had been put into place in 1975, remained in effect. A National Research Council panel that reviewed the use of intelligence tests for EMR placement during the era of the Larry P. case offered two “inescapable” conclusions:

A Century of Testing Controversies 141

First, the use of IQ scores as placement criteria will tend to maintain a disproportionate representation of minority children in EMR classes. … as long as IQ scores play a role in decision making, some disproportion will undoubtedly remain … The second conclusion follows from the discrepancy between actual EMR placement rates and the rates that would theoretically prevail if IQ alone was the placement criterion. Elements other than testing, which are part of the chain of referral, evaluation, and placement, must also be operating to reduce both the overall proportions of children placed in EMR classes and the disproportion between minority children and whites. (Heller, Holtzman, & Messick, 1982, pp. 42–43) Thus, the panel agreed that IQ tests played a role in the disproportionate representation of minority students in EMR classes, while pointing out that in general, placement decisions were evidently not being made solely on the basis of test scores.

The 1990s: The Bell Curve IQ testing reentered the public discourse in dramatic fashion with the publication in 1994 of The Bell Curve: Intelligence and Class Structure in American Life. The authors, Richard Herrnstein and Charles Murray, based their hefty tome on a review of existing literature, as well as their own analyses of data from the National Longitudinal Survey of Youth. They emphasized the role of IQ in life outcomes and argued that, because of the role of genetics in determining intelligence, policy efforts to ameliorate inequities in society were likely to be ineffective, an argument similar to that advanced in Jensen’s 1969 article. Particularly controversial was their genetic explanation for differences in IQ test score distributions between Black and White test takers. Herrnstein and Murray, who regarded IQ scores as essentially synonymous with intelligence, made the case that intelligence affects rates of high school dropout, poverty, and criminality; job performance; patterns of marriage, divorce, and child-bearing; and parenting skills. In a claim reminiscent of the writings of Brigham, they argued that higher fertility and a lower average age of childbirth among the “less intelligent,” along with “an immigrant population that is probably somewhat below the native-born average” implied that “something worth worrying about is happening to the cognitive capital of the country” (p. 364). The Bell Curve, which rapidly became a best-seller, inspired hundreds of reviews and commentaries, including several book-length compilations. Critics disputed the book’s conceptualization of intelligence as a fixed unidimensional characteristic (e.g., Gould, 1996b; Hunt, 1997) and its claims about the degree to which intelligence is heritable (Daniels, Devlin, & Roeder, 1997) and malleable (Wahlsten, 1997). In addition, some reanalyses of the data used by Herrnstein and Murray led to conclusions that were vastly different from those presented in The

142 Rebecca Zwick

Bell Curve about the role of intelligence in determining life outcomes and about the genetic nature of racial differences in IQ test score distributions (Resnick & Fienberg, 1997).

Controversies About College Admissions Testing In the nineteenth century, the process of applying to American colleges and universities was quite different from today’s frenzied competition. In fact, until the late 1800s, institutions typically “waited for students to present themselves for examination by the faculty” (Wechsler, 1977, p. viii). In 1870, the University of Michigan pioneered a major policy change: Instead of examining individual applicants, it would inspect feeder high schools and waive entrance exams for graduates of approved schools. These students were given certificates and automatically admitted to the university. From today’s perspective, it seems ironic that the system was apparently disliked by Michigan students, who believed that “without entrance examinations, teachers and students had become more lax in their preparation” (Wechsler, 1977, pp. 25–26). Students also considered the policy unfair because talented candidates from poor high schools were still required to take entrance exams while mediocre students from good high schools were automatically admitted. The certificate system caught on, however, and had spread to most American colleges by 1915 (Wechsler, 1977). In 1900, another major change in the world of college admissions occurred when the leaders of 12 colleges and universities formed the College Entrance Examination Board, which created a common system of essay examinations that were to be administered by member institutions and scored centrally by the Board. Yet another milestone event took place in 1926, when the Board first administered the multiple-choice Scholastic Aptitude Test, which had been developed under Brigham’s leadership. Today the test is owned by the College Board and administered by Educational Testing Service under a contract with the Board. The American College Testing program provided an alternative college entrance exam beginning in 1959. Today, the official name of that test is the ACT; the associated company is ACT, Inc. In 2019, about 2.2 million students took the SAT and 1.8 million took the ACT. (Some, of course, took both tests.) In general, the SAT has been more heavily criticized than the ACT, perhaps because of its perceived historic links to IQ testing and, less directly, to the eugenics movement. A related distinction between the two exams is that the SAT was for many years less closely linked to classroom learning than the ACT, so that the SAT was often considered to be an ability test, while the ACT was viewed as an achievement test. In the U.S., achievement testing is generally viewed as less objectionable than tests of aptitude, which hark back to unsavory uses of intelligence tests. The difference between the types of content included in the SAT and ACT has diminished greatly over the years, and scores on the two tests are highly correlated.3

A Century of Testing Controversies 143

Because of the extensive opposition to admissions tests over the years, it is impossible to provide a comprehensive review. (For a recent discussion of SAT criticisms, along with rebuttals, see Sackett & Kuncel, 2018 as well as Croft & Beard, Ch. 2, this volume.) Historically, the main objections to admissions tests have been (1) that the test content is elitist and more likely to be familiar to White upper-class test takers than to other candidates, (2) that tests are not sufficiently tied to classroom learning, (3) that an aura of secrecy surrounds the testing enterprise, (4) that tests are coachable, undermining fairness, (5) that tests do not predict college performance or do so only because they are, in fact, measures of socioeconomic status (SES), and (6) that differences in average scores for ethnic and socioeconomic groups demonstrate the existence of test bias. Changes in test policy and test content have been somewhat effective in addressing the first three criticisms. The content of the SAT in particular has become much more closely aligned with classroom learning. Lawrence, Rigol, Van Essen, and Jackson (2004) provide a review of SAT content changes through 2001; further revisions of the test occurred in 2005 and in 2016. In addition, disclosure of selected test forms has become routine throughout the testing industry, and vast numbers of technical reports on college admissions tests are available at no cost on the College Board, ETS, and ACT websites. Regarding the role of coaching (point 4), testing organizations today acknowledge that scores can be improved to some degree through test preparation, and both the ACT and SAT offer free online preparation. It is true that test takers do not have equal access to the more intensive (and expensive) forms of test preparation; in particular, greater efforts to create free and low-cost in-person programs are needed. However, claims that admissions test scores are “meaningless … because the affluent can game the system by securing expensive tutoring,” as asserted by the plaintiffs in a recent lawsuit against the University of California (Hoover, 2019), should be viewed with skepticism. The most responsible and rigorous research on test coaching has shown that the typical score increases achieved are substantially smaller than claimed by test preparation companies (e.g., see Briggs, 2009). Also, alternative admissions criteria, such as high school grades, can be increased through expensive tutoring as well. There are rarely calls for eliminating grades from consideration. The assertion that admissions test scores do not predict college performance or do so only because of their association with SES (point 5) is seriously misleading. It is of course true that test scores do not provide perfect predictions, but the evidence that they improve upon the predictions of college grades that can be achieved using high school grades alone is extensive (see Zwick, 2019a for a brief review). In some cases, admissions test scores have been found to be more predictive than are high school grades (e.g., University of California Academic Senate, 2020). Admissions test scores have been found useful in predicting graduation as well. A variant on the claim that scores do not predict performance has gained some traction – the belief that the only reason that scores are predictive of college

144 Rebecca Zwick

grades is that both the test scores and grades are influenced by SES (e.g., Geiser & Studley, 2002; Rothstein, 2004). These claims have been countered by recent research (see Sackett & Kuncel, 2018; Zwick, 2017, pp. 87–89). In particular, Sackett and his colleagues have conducted a series of large-scale studies examining the association between test scores and subsequent grades, controlling for SES. (More specifically, they obtained partial correlations and then aggregated them using meta-analytic techniques.) They concluded that “across all studies, the admissions test retains over 90% of its predictive power when SES is controlled” (Sackett & Kuncel, 2018, p. 27). The most significant objection to admissions tests involves the substantial differences among ethnic and socioeconomic groups in average scores (point 6). These recurrent patterns continue to be a source of concern that warrants ongoing scrutiny of test content and testing policy. It is important to recognize, however, that these performance differences are not unique to standardized tests and that score differences alone do not demonstrate bias. Grades and other academic measures show similar patterns (see Zwick, 2019a), pointing to the crucial need to improve educational opportunities for all students. Given the troubling childhood poverty rate, the variation of the poverty rate across ethnic groups (National Center for Education Statistics, 2014), and the vast inequities in school quality in the U.S., it is not surprising to find that average academic performance varies across ethnic and socioeconomic groups. Nevertheless, whether score differences are somehow intrinsic to the tests, as some believe, or are reflections of educational inequities, it is true that heavy reliance on test scores in admissions decisions is likely to lead to lower selection rates for students of color as well as applicants from low-income families. It is for this reason that some postsecondary institutions have adopted test-optional policies. In the sections below, I identify and review several waves of opposition to admissions testing – one beginning in the late 1970s, one beginning in the late 1990s, and the current wave, which includes the expansion of the test-optional movement.

From The Reign of ETS to The Case Against the SAT At almost the same moment in 1980, two scathing critiques of the SAT and its sponsors emerged: The Reign of ETS: The Corporation that Makes Up Minds, a book authored by Allan Nairn and Associates and published by the Ralph Nader organization, and “The Scholastic Aptitude Test: A Critical Appraisal,” an article that appeared in the Harvard Educational Review. The Reign of ETS, according to the preface by consumer advocate Ralph Nader, was written to make Americans “aware of a testing system which encourages both teachers and students to prepare for such trivial results – results which predict poorly and perpetuate social biases” (p. xiv). The 554-page report had an impact on the testing industry even before its publication. Nader’s investigation of ETS, which began some three years earlier,

A Century of Testing Controversies 145

influenced the decision of the Federal Trade Commission to investigate the test preparation industry and served to support the Truth in Testing Legislation pending in New York State, which mandated the disclosure of previously administered test items (Lemann, 1999). The authors of the “critical appraisal” of the SAT, Warner V. Slack and Douglas Porter, asserted that, “contrary to the conclusions of ETS and the College Board, students can raise their scores [on the SAT] with training, the scores are not a good predictor of college performance, and the test does not measure ‘capacity’ to learn” (1980, p. 155). They also argued that students were “seriously wronged” by the misrepresentation of the SAT as a measure of aptitude. (The name of the test was later changed from Scholastic Aptitude Test to Scholastic Assessment Test, and finally to “SAT,” which is no longer considered an acronym.) Several years later, two other testing critiques drew wide attention: the flashy None of the Above: Behind the Myth of Scholastic Aptitude (Owen, 1985) and the much more careful and reasoned treatise, The Case Against the SAT (Crouse & Trusheim, 1988).

The Late 1990s: Widespread Opposition and a University of California Surprise After a period of relative dormancy, opposition to admissions testing flared up again in the late 1990s. Among the organizations that argued in favor of deemphasizing or eliminating admissions tests – or the SAT in particular – were the National Association for the Advancement of Colored People (NAACP; Blair, 1999), the National Research Council (Beatty, Greenwood, & Linn, 1999), and the Education Commission of the States, a nonprofit education policy organization (Selingo, 1999). In 1999, journalist Nicholas Lemann published The Big Test: The Secret History of the American Meritocracy, which provided a highly critical perspective on the SAT, ETS, and American concepts of merit. These developments were followed in 2001 by an event that set off a major national debate about admissions testing: In a surprise announcement, Richard Atkinson, president of the University of California, declared that the SAT should be abandoned as a criterion for admission to UC, saying it was “akin to an IQ test” and that reliance on test scores was “not compatible with the American view on how merit should be defined and opportunities distributed” (Atkinson, 2001). The widespread discussions that ensued were instrumental in prompting a substantial revision of the SAT, first administered in 2005. Standardized admissions tests, however, continued to be required for admission to the University of California.

The Admissions Testing Controversy Today Although Atkinson’s speech did not, as some expected, end the use of the admissions tests at UC, it may have been a harbinger of things to come. In late

146 Rebecca Zwick

2019, several advocacy organizations announced that they were suing the Regents of the University of California, stating, “Our individual clients are wellqualified students who have been subject to unlawful discrimination on the basis of race, disability, and wealth as a result of the requirement that applicants to the University of California submit either SAT or ACT scores in order to be considered for admission to any campus.” In addition, they claimed that the SAT and ACT are “descendants of discriminatory IQ tests” and their use “is an unlawful practice in violation of the California Constitution’s equal protection clause and numerous State anti-discrimination statutes, and it is barring our clients from equal access to higher education” (Hoover, 2019). In May 2020, Janet Napolitano, the president of the university, recommended that the SAT and ACT be eliminated as an admissions requirement at UC’s ten campuses through 2024, citing “the correlation of the SAT and the ACT to the socio-economic level of the student, and in some case[s], the ethnicity of the student” (Wilner, 2020). Her proposal to the UC Regents included a commitment to undertake a process to identify or create a new test that aligns with the content UC expects students should have mastered … If UC is unable to either modify or create a test that meets these criteria and can be available for applicants for fall 2025, UC will eliminate altogether the use of the ACT/ SAT for freshman admissions. (University of California Office of the President, 2020) In a landmark decision, the UC Regents voted to support Napolitano’s proposal. This outcome was somewhat surprising, given that, just three months earlier, an expert panel, the UC Academic Council Standardized Testing Task Force, had concluded that admissions test scores were more predictive of UC students’ college performance than were high school grades and that “UC does not use test scores in a way that prevents low-scoring students from disadvantaged groups from being admitted to UC as long as their applications show academic achievement or promise in other ways” (University of California Academic Senate, 2020, p. 20). UC’s move away from admissions tests was accelerated when the plaintiffs in the 2019 lawsuit, Smith v. Regents of the University of California, were granted a preliminary injunction in 2020 that prohibited even the consideration of admissions test scores at any UC campus (Hoover, 2020). The UC Regents’ decision can be seen as a product of ongoing opposition to standardized tests at the University, compounded by three external factors. First, the decision came at a time when the test-optional movement was already gaining steam. Typical test-optional programs allow college applicants to decide whether to submit admissions test scores. If scores are submitted, they are considered, but nonsubmitters are not at a disadvantage in the admissions process. The National Center for Fair & Open Testing, a testing watchdog organization better known as FairTest, maintains a list of schools that do not require test

A Century of Testing Controversies 147

scores. As of early 2021, the list includes more than 1350 schools that are said to be test-optional for Fall 2022 admission.4 In a piece headlined, “A record number of colleges drop SAT/ACT admissions requirement amid growing disenchantment with standardized tests” (Strauss, 2019) the Washington Post noted that test-optional schools now include the University of Chicago, Brandeis University, the University of Rochester, Wake Forest University, and Wesleyan University. A second development that may have influenced the UC decision is the “Varsity Blues” scandal, which broke in 2019. Federal prosecutors charged 50 people in a scheme to purchase admission to elite universities (Medina, Benner, & Taylor, 2019). In some of these cases, parents of college applicants arranged to have impostors take admissions tests in place of their children or to have test administrators alter their children’s test scores. The flamboyant fraud clearly increased public cynicism about admissions testing. The third factor that may well have affected the outcome of deliberations at the University of California is the outbreak of the coronavirus pandemic in early 2020. Because of testing center closures, many universities waived admissions testing requirements – in some cases, for multiple admissions cycles. The Tufts University administration, for example, took the opportunity to announce the initiation of “a three-year experiment with going test-optional,” saying that “while the COVID-19 pandemic’s impact on SAT and ACT testing opportunities contributed to the urgency of this policy, this decision aligns with our ongoing efforts … to promote maximum access to a Tufts education” (Jaschik, 2020). The coming decade promises to be an era of change in the world of admissions testing.

Controversies about Accountability Testing Since the passage of the No Child Left Behind (NCLB) legislation, the 2002 update to the Elementary and Secondary Education Act (ESEA), the pressure for schools and teachers to be “accountable” for student achievement has been growing, leading to an increase in state-mandated testing and a greater emphasis on the role of student test scores in assessing educational quality. The most recent reauthorization of the ESEA, the Every Student Succeeds Act (ESSA), was approved in 2015. While granting states more flexibility than NCLB in designing their testing programs, ESSA represents a continued emphasis on test-based accountability, requiring that states establish college- and career-ready standards and maintain high expectations when assessing all students against those standards. … Tests must measure higher-order thinking skills, such as reasoning, analysis, complex problem solving, critical thinking, effective communication, and understanding of challenging content (U.S. Department of Education, 2017)

148 Rebecca Zwick

From a political perspective, the controversy surrounding K–12 accountability testing is quite different from the debates about IQ and admissions tests. Whereas civil rights organizations have often raised concerns about the use of intelligence tests and admissions tests, some of these same entities have voiced support of accountability testing in the schools. In 2015, 12 civil and human rights organizations, including The Leadership Conference on Civil and Human Rights, the NAACP, the National Council of La Raza, the National Disability Rights Network, and the National Urban League, issued a statement opposing the antitesting movement. Pointing out that “we cannot fix what we cannot measure,” the statement’s authors noted that Standardized tests, as “high stakes tests,” have been misused over time to deny opportunity and undermine the educational purpose of schools, actions we have never supported and will never condone. But the anti-testing efforts that appear to be growing in states across the nation, like in Colorado and New York, would sabotage important data and rob us of the right to know how our students are faring. When parents “opt out” of tests—even when out of protest for legitimate concerns—they’re not only making a choice for their own child, they’re inadvertently making a choice to undermine efforts to improve schools for every child. (Leadership Conference on Civil and Human Rights, 2015) National surveys conducted in 2013 and 2015 showed that Black and Hispanic respondents had more favorable attitudes toward testing than White respondents and that support for testing tended to be lower among high-income respondents than among low-income respondents. Similarly, studies of the opt-out movement, which encourages parents to decline to allow their children to take standardized tests in school, have shown that advocates of opting out are more likely to be White than members of other ethnic groups and are typically not economically disadvantaged (Bennett, 2016). Reasons cited for opting out include claims that tests and test preparation use too much instructional time, that the tests in question are irrelevant or too difficult, and that the tests place too much pressure on students and educators. In some districts, nonparticipation rates exceeding 50% for state-mandated tests have been reported (Bennett, 2016). An interesting illustration of the complex politics of accountability testing is the case of education historian and former assistant secretary of education Diane Ravitch, who describes herself as having “long been allied with conservative scholars and organizations” (Ravitch, 2010, p. 12). Originally an enthusiastic proponent of standards and accountability testing, she believed that “testing would shine a spotlight on low-performing schools” (2010, pp. 3–4). She gradually became concerned that testing had become an end in itself, declaring that the standards movement, which she initially viewed as merely a way to ensure

A Century of Testing Controversies 149

that students had mastered basic skills, had been “hijacked” by the testing movement. Ultimately, she concluded that Our schools will not improve if we rely exclusively on tests as the means of deciding the fate of students, teachers, principals, and schools. When tests are the primary means of evaluation and accountability, everyone feels pressure to raise the scores, by hook or by crook. Some will cheat to get a reward or to avoid humiliation … Districts and states may require intensive test preparation that … borders on institutionalized cheating. Any test score gains that result solely from incentives are meaningless … and have nothing to do with real education. (2010, p. 227)

Using Student Test Scores to Evaluate Teacher Performance An aspect of accountability testing that has drawn widespread concern, even among testing supporters, is the use of student test scores as a key component in the evaluation of educational programs. Both the general public and the education community have raised objections to this practice. Particularly questionable is the use of this information to assess teacher performance and to make decisions about teachers’ salaries, promotion, and tenure. Clearly, the distribution of test scores for a student group depends heavily on both the life experiences and educational background of the students and the resources available to the teacher. During the late 1990s, a statistical approach intended to take these factors into account, called value-added modeling (VAM), began to attract the attention of educators and the media. VAM is intended to adjust for students’ prior achievement and for factors such as the student’s family background and the quality of the school’s leadership when estimating a teacher’s contribution to students’ academic performance. Although models differ in their specifics, “a value-added estimate is meant to approximate the contribution of the school, teacher, or program to student performance” (Braun, Chudowsky, & Koenig, 2010, p. 5). A period of unrealistic enthusiasm about these models was followed by muchneeded scrutiny. Some of VAM’s limitations are technical; others are practical or philosophical. Among the technical objections: The statistical models used in VAM may inadequately adjust for differences in student populations or school conditions or may overcorrect for these differences. In either case, results will be biased. Results may also be imprecise, particularly if the tests on which they are based are unreliable or the sample of students is small. In addition, interpretation of VAM results depends on assumptions about the scale properties of the tests that may not hold, such as equal-interval assumptions. Practical obstacles include the fact that the needed data may not be available for all teachers, that some teaching practices, such as team teaching, do not lend themselves to VAM analyses, and that VAM results lack transparency and are

150 Rebecca Zwick

difficult to understand. Finally, and most fundamentally, many educators and other members of the public believe that the applications of VAM result in too much emphasis on test scores to the exclusion of other measures of educational quality.

Conclusions In the 1949–50 ETS annual report, the company’s first president, Henry Chauncey, offered a pronouncement about the future of testing: [W]ith respect to knowledge of individuals, the possibilities of constructive use are far greater than those of misuse. Educational and vocational guidance, personal and social adjustment most certainly should be greatly benefited. Life may have less mystery, but it will also have less disillusionment and disappointment. (ETS, 1949–1950, pp. 9–10) Today, even the staunchest proponents of educational tests would be unlikely to subscribe to this sunny view. Test scores often serve to govern the allocation of scarce resources and as such, often provoke both controversy and disappointment. During the last century, intelligence testing, admissions testing, and accountability testing have all met with strong opposition in the United States. Fortunately, these testing controversies have had some beneficial effects, leading to improvements in tests and testing policy. One notable change since the emergence of the National Intelligence Tests one hundred years ago is that the testing profession today has formal standards for test construction, administration, scoring, and reporting, the most prominent of which is the Standards for Educational and Psychological Testing. The current version of the Standards, which is jointly sponsored by the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education, was published in 2014. It was preceded by five standards documents, the first of which was published by the American Psychological Association in 1954. In keeping with these standards, analyses of test content, test validity, and test fairness are conducted by all reputable test purveyors today. Technical reports and research reports are, in general, publicly available, and for many assessments, selected test forms are routinely released. It is assumed that testing companies will scrutinize performance differences across demographic groups and will give careful attention to the technical quality of tests and to the provision of appropriate modifications and accommodations for English learners, students with disabilities, and other special populations. Yet, despite these improvements, testing controversies will continue. As I have noted elsewhere, it is inevitable that “[e]very method for allocating prized goods (college seats, jobs) will be viewed as unfair by some individual or entity” (Zwick,

A Century of Testing Controversies 151

2019b, p. 38). A case in point is the lottery that was used to select a portion of the freshman class of 1970 at the University of Illinois, resulting in the rejection of more than 800 applicants. The public fury was so great that the lottery results were rescinded and all rejected students were accepted (Chicago Tribune, 1969), a clear demonstration that admissions decisions can be contentious even when test scores are explicitly excluded from the process. Debates about testing are never merely about the assessments per se, but about their role in determining status and allocating benefits in our society. In their preface to a special journal issue on testing, Berman, Feuer, and Pellegrino (2019, p. 11) note that “debates about fairness, opportunity, freedom, and the authority of government were all implicated both in the earliest and crudest written examinations and in every generation of machine-powered standardized tests that have followed.” Because tests continue to be used for such purposes as classifying students and assigning them to special classes, determining who is admitted to college, and evaluating schools and teachers, they will continue to be a lightning rod for criticism.

Notes 1 I am grateful for comments from Randy Bennett, Brent Bridgeman, Brian Clauser, and Tim Davey. 2 Detailed information on the case, including exact quotations, were obtained from https:// law.justia.com/cases/federal/district-courts/FSupp/495/926/2007878/ 3 Dorans and Moses (2008) found the correlation between the ACT composite score and the total SAT score (Reading + Math + Writing) to be .92. 4 https://www.fairtest.org/university/optional

References American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. Atkinson, R. (2001, February). Standardized tests and access to American universities. The 2001 Robert H. Atwell Distinguished Lecture, delivered at the annual meeting of the American Council on Education, Washington, DC. Retrieved from http://works.bepress. com/richard_atkinson/36. Beatty, A., Greenwood, M. R., & Linn, R. (Eds.). (1999). Myths and tradeoffs: The role of testing in undergraduate admissions. Washington, DC: National Academy Press. Bennett, R. E. (2016). Opt out: An examination of the issues. (ETS Research Report 16-13). Princeton, NJ: Educational Testing Service. Berman, A. I., Feuer, M. J., & Pellegrino, J. W. (2019). What use is educational assessment? Annals of the American Academy of Political and Social Science, 683, 8–20. Blair, J. (1999, December 1). NAACP criticizes colleges’ use of SAT, ACT. Education Week. Retrieved from https://www.edweek.org/ew/articles/1999/12/01/14naacp.h19.html. Braun, H., Chudowsky, N., & Koenig, J. (Eds.) (2010). Getting value out of value-added: Report of a workshop. Washington, DC: National Academies Press.

152 Rebecca Zwick

Briggs, D. C. (2009). Preparation for college admission exams. (NACAC Discussion Paper.) Arlington, VA: National Association for College Admission Counseling. Brigham, C. C. (1923). A study of American intelligence. Princeton, NJ: Princeton University Press. Camara, W. J. (2009). College admission testing: Myths and realities in an age of admissions hype. In R. P. Phelps (Ed.), Correcting fallacies about educational and psychological testing, pp. 147–180. Washington, DC: American Psychological Association. Chicago Tribune (1969, December 13). U. of I. opens doors for 839 barred by admissions lottery. Chicago Tribune. Retrieved from http://archives.chicagotribune.com/1969/12/ 13/page/1/article/u-of-i-opens-doors-for-839-barred-by-admissions-lottery. Cronbach, L. J. (1975). Five decades of public controversy over mental testing. American Psychologist, 30(1), 1–14. Crouse, J., & Trusheim, D. (1988). The case against the SAT. Chicago, IL: University of Chicago Press. Daniels, M., Devlin, B., & Roeder, K. (1997). Of genes and IQ. In B. Devlin, S. E. Fienberg, D. P. Resnick, & K. Roeder (Eds.), Intelligence, genes, and success: Scientists respond to The Bell Curve (pp. 45–70). New York, NY: Copernicus. Dorans, N., & Moses, T. (2008). SAT/ACT concordance 2008: Which score pair to concord? (ETS Statistical Report No. SR 2008–2092). Princeton, NJ: Educational Testing Service. Du Bois, W. E. B. (1920). Race intelligence. The Crisis, 20(3), 119. Educational Testing Service (1949–1950). Educational Testing Service annual report to the Board of Trustees. Princeton, NJ: Author. Geiser, S., & Studley, R. (2002). UC and the SAT: Predictive validity and differential impact of the SAT I and SAT II at the University of California. Educational Assessment, 8, 1–26. Gould, S. J. (1996a). The mismeasure of man (revised & expanded edition). New York, NY: Norton. Gould, S. J. (1996b). Critique of The Bell Curve. In S. J. Gould, The mismeasure of man (revised & expanded edition) (pp. 367–390). New York, NY: Norton. Haney, W. (1981). Validity, vaudeville, and values: A short history of social concerns over standardized testing. American Psychologist, 36, 1021–1034. Heller, K. A., Holtzman, W., & Messick, S. (Eds.) (1982). Placing children in special education: A strategy for equity. Washington, DC: National Academy Press. Herrnstein, R. J., & Murray, C. (1994). The bell curve: Intelligence and class structure in American life. New York, NY: The Free Press. Hoover, E. (2019, October 29). U. of California faces bias lawsuit over ACT/SAT requirement. Chronicle of Higher Education. Retrieved from https://www.chronicle. com/article/U-of-California-Faces-Bias/247434. Hoover, E. (2020, September 1). Court bars U. of California from using ACT and SAT for fall-2021 admissions. Chronicle of Higher Education. Retrieved from https://www. chronicle.com/article/court-bars-u-of-california-from-using-act-and-sat-for-fall-2021admissions. Hunt, E. (1997). The concept and utility of intelligence. In B. Devlin, S. E. Fienberg, D. P. Resnick, & K. Roeder (Eds.), Intelligence, genes, and success: Scientists respond to The Bell Curve (pp. 157–176). New York, NY: Copernicus. Jaschik, S. (2020, March 30). Coronavirus drives colleges to test optional. Inside Higher Education. Retrieved from https://www.insidehighered.com/admissions/article/2020/ 03/30/coronavirus-leads-many-colleges-including-some-are-competitive-go-test.

A Century of Testing Controversies 153

Jensen, A. R. (1967). The culturally disadvantaged: Psychological and educational aspects. Educational Research, 10, 4–20. Jensen, A. R. (1969). How much can we boost IQ and scholastic achievement? Harvard Educational Review, 39, 1–123. Jensen, A. R. (1981). Straight talk about mental tests. New York, NY: Free Press. Koretz, D. (2017). The testing charade. Chicago, IL: University of Chicago Press. Larry P. v. Riles 1972 343 F. Supp. 1306 (N. D. Cal. 1972) Larry P. v. Riles, 495 F. Supp. 926 (N.D. Cal. 1979) Lawrence, I., Rigol, G., Van Essen, T., & Jackson, C. (2004). A historical perspective on the content of the SAT. In R. Zwick (Ed.), Rethinking the SAT: The future of standardized testing in university admissions (pp. 57–74). New York, NY: RoutledgeFalmer. Lemann, N. (1999). The big test: The secret history of the American meritocracy. New York, NY: Farrar, Strauss, and Giroux. Linn, R. L. (2001). A century of standardized testing: Controversies and pendulum swings, Educational Assessment, 7, 29–38. Lippmann, W. (1922, November 29). A future for the tests. The New Republic, 33, 9–10. McNutt, S. (2013). A dangerous man: Lewis Terman and George Stoddard, their debates on intelligence testing, and the legacy of the Iowa Child Welfare Research Station. The Annals of Iowa, 72(1), 1–30. Medina, J., Benner, K., & Taylor, K. (2019, March 12). Actresses, business leaders and other wealthy parents charged in U.S. college entry fraud. New York Times. Retrieved from https:// www.nytimes.com/2019/03/12/us/college-admissions-cheating-scandal.html. Nairn, A. and Associates (1980). The reign of ETS. (The Ralph Nader Report on the Educational Testing Service). Washington, DC: Ralph Nader. National Center for Education Statistics (2014). Digest of Education Statistics. Retrieved from http://nces.ed.gov/programs/digest/2014menu_tables.asp. Owen, D. (1985). None of the above: Behind the myth of scholastic aptitude. Boston, MA: Houghton Mifflin. Ravitch, D. (2010). The death and life of the great American school system: How testing and choice are undermining education. New York, NY: Basic Books. Rebell, M. A. (1989). Testing, public policy, and the courts. In B. R. Gifford (Ed.), Test policy and the politics of opportunity allocation: The workplace and the law (pp. 135–162). Boston, MA: Kluwer Academic. Resnick, D. P., & Fienberg, S. E. (1997). Science, public policy, and The Bell Curve. In B. Devlin, S. E. Fienberg, D. P. Resnick, & K. Roeder (Eds.), Intelligence, genes, and success: Scientists respond to The Bell Curve (pp. 327–339). New York, NY: Copernicus. Rothstein, J. M. (2004). College performance and the SAT. Journal of Econometrics, 121, 297–317. Sackett, P. R., & Kuncel, N. R. (2018). Eight myths about standardized admissions testing. In J. Buckley, L. Letukas, & B. Wildavsky (Eds.), Testing, academic achievement, and the future of college admissions (pp. 13–39). Baltimore, MD: Johns Hopkins University Press. Selingo, J. (1999, July 13). Colleges urged to find alternative to standardized tests for admissions. Chronicle of Higher Education. Retrieved from https://www.chronicle.com/article/ Colleges-Urged-to-Find/113584. Slack, W. V., & Porter, D. (1980). The Scholastic Aptitude Test: A critical appraisal. Harvard Educational Review, 50, 154–175. Soares, J. A. (2012). SAT wars: The case for test-optional admissions. New York, NY: Teachers College Press.

154 Rebecca Zwick

Strauss, V. (2019, October 18). A record number of colleges drop SAT/ACT admissions requirement amid growing disenchantment with standardized tests. Washington Post. Retrieved from https://www.washingtonpost.com/education/2019/10/18/recordnumber-colleges-drop-satact-admissions-requirement-amid-growing-disenchantmentwith-standardized-tests/. Terman, L. M. (1916). The measurement of intelligence: An explanation of and a complete guide for the use of the Stanford revision and extension of the Binet-Simon Intelligence Scale. Boston, MA: Houghton Mifflin. Terman, L. M. (1922, December 27). The great conspiracy. The New Republic, 33, 116–120. The Leadership Conference on Civil & Human Rights (2015, May 15). Civil Rights Groups: “We Oppose Anti-Testing Efforts.” Retrieved from https://civilrights.org/ 2015/05/05/civil-rights-groups-we-oppose-anti-testing-efforts/. University of California Academic Senate (2020, January). Report of the UC Academic Council Standardized Testing Task Force. Retrieved from https://senate.universityofcalifor nia.edu/_files/committees/sttf/sttf-report.pdf. University of California Office of the President (2020). Action item for the meeting of May 21, 2020: College entrance exam use in University of California undergraduate admissions. Retrieved from https://regents.universityofcalifornia.edu/regmeet/may20/ b4.pdf. U.S. Department of Education (2017, December 7). Every Student Succeeds Act assessments under Title I, part A & Title I, part B: Summary of final regulations. Retrieved from https:// www2.ed.gov/policy/elsec/leg/essa/essaassessmentfactsheet1207.pdf. Wahlsten, D. (1997). The malleability of intelligence is not constrained by heritability. In B. Devlin, S. E. Fienberg, D. P. Resnick, & K. Roeder (Eds.), Intelligence, genes, and success: Scientists respond to The Bell Curve (pp. 71–87). New York, NY: Copernicus. Wechsler, H. S. (1977). The qualified student: A history of selective college admission in America. New York, NY: Wiley. Whipple, G. M. (1921). The national intelligence tests. Journal of Educational Research, 4, 16–31. Wilner, J. (2020, May 21). University of California eliminates SAT, ACT exams from admissions process throughout the system. San Jose Mercury News. Retrieved from https:// www.mercurynews.com/2020/05/21/university-of-california-eliminates-sat-act-examsfrom-admissions-process-throughout-the-system/. Zwick, R. (2017). Who gets in? Strategies for fair and effective college admissions. Cambridge, MA: Harvard University Press. Zwick, R. (2019a). Assessment in American higher education: The role of admissions tests. Annals of the American Academy of Political and Social Science, 683, 130–146. Zwick, R. (2019b). Fairness in measurement and selection: Statistical, philosophical, and public perspectives. Educational Measurement: Issues and Practice, 38, 34–41.

PART II

Measurement Theory and Practice

8 A HISTORY OF CLASSICAL TEST THEORY Brian E. Clauser1

In his classic 1950 text, Harold Gulliksen summarized the history of classical test theory in a single sentence, “Nearly all the basic formulas that are particularly useful in test theory are found in Spearman’s early papers”. The focus of this chapter is on how those formulas came about and how they were developed in the first half of the 20th century to create classical test theory. I begin the chapter by describing the intellectual culture that produced the tools that Spearman would subsequently apply to test scores. I then summarize the contributions to the statistical understanding of test scores made by Spearman and his contemporaries. Finally, I briefly describe the work of Kelley, Kuder and Richardson, Cronbach, and Lord that led from Spearman’s formulas to coefficient alpha and the foundations of generalizability theory and item response theory.

Alfred Russel Wallace and Charles Darwin In considering the historical events that led to Spearman’s work, one might say that a butterfly flapped its wings on an island in the Malay Archipelago and half a century later, half-way around the world, that flapping led to the creation of classical test theory. The butterfly in question was of the genus Ornithoptera, the bird-winged butterfly. During the eight years Alfred Wallace spent in the Malay Archipelago, he supported himself collecting natural history specimens to be sent home and sold in England. Of the more than 125,000 specimens he collected, 13,000 were Lepidoptera (Wallace, 1869); some of these were Ornithoptera. This particular genus was important because the distribution of its species across the archipelago gave Wallace one of the clues he needed to come to understand the origin of species (Brooks, 1984). By February 1858, the clues that Wallace had collected during years of field work coalesced into a theory, and between bouts

158 Brian E. Clauser

of fever he summarized that theory in a paper entitled On the Tendency of Varieties to Depart Indefinitely from the Original Type. The paper was an elegant summary of the theory of the origin of species through natural selection. Wallace mailed the paper back to England to Charles Darwin. At the point he received Wallace’s paper, Darwin had spent much of the previous two decades working on the same problem. He had outlined a theory almost identical to Wallace’s in two unpublished papers written years earlier and had collected massive amounts of material for a planned multivolume work in support of his theory. He had shared the unpublished papers with his closest friends and had been encouraged to publish to prevent someone else from taking priority (Quammen, 2006). The events following the arrival of Wallace’s letter are well known; Darwin was devastated and understood that he could no longer delay publication. Charles Lyell and Joseph Hooker, two preeminent members of the British scientific community—and two of Darwin’s closest friends—arranged to have Wallace’s paper published with two short pieces by Darwin (Brooks, 1984). More importantly, Darwin narrowed the scope of his planned project and in little over a year the first edition of On the Origin of Species was published (Darwin, 1859).2 There can be little question that Darwin’s book changed history. One individual who was profoundly impressed with the volume was Darwin’s first cousin, Francis Galton. It is through the impact that the book had on Galton that the groundwork was laid for the development of the statistical theory of test scores.

Francis Galton Galton was the quintessential Victorian polymath. He created some of the first weather maps and discovered the phenomenon known as the anticyclone. He also was among the first to propose the use of fingerprints for identification; the three books he wrote on the topic were instrumental in the adoption of fingerprint technology (Galton, 1892; 1893; 1895). He explored a part of Africa unknown to Europeans years before Stanley uttered the phrase “Dr. Livingstone I presume” (Galton, 1853), and he wrote extensively on eugenics (Galton, 1909); in fact, he coined the term. Of all these efforts the contribution that arguably has had the most lasting impact is the statistical methodology he developed to examine the inheritance of individual characteristics. That work was directly motivated by his interest in evolution.3 At the urging of his father, Galton began his intellectual life by studying medicine at Kings College London. Before completing his medical training he moved to Cambridge to study mathematics and when his father’s death left him financially secure, he gave up the idea of a career in medicine. His efforts to earn an honors degree in mathematics led to physical and mental collapse. After his recovery and graduation, he used his leisure and financial resources to mount an expedition to Africa. The results won him public notoriety as well as recognition

A History of Classical Test Theory 159

in the geographical community. He published a popular book describing his travels (Galton, 1853) and gained a level of notoriety within the scientific community. But as Stigler (1986) commented, “Darwin’s theories opened an intellectual continent” (p. 267) for Galton to explore that offered much greater challenges than his travels in Africa; mapping that new continent occupied Galton for decades. Galton began his work on inheritance by noticing and documenting what may have been obvious for someone in the leisure and educated class of Victorian England: he documented the extent to which exceptional talent appeared to be shared across generations in the same family. The result of this effort was the publication of Hereditary Genius; in this volume, Galton documented instance after instance in which notable individuals have notable relatives in the same field of endeavor (Galton, 1869). He used his findings to argue that intellectual and personality characteristics can be inherited in the same way that physical characteristics are passed from parent to child. In retrospect, the book seems simplistic because Galton too quickly discounted environmental factors (nurture) and personal advantage as significant explanations for the results he reported. Seen within the context of the times, however, the work must have been more impressive. Darwin clearly overlooked the alternative explanations when he wrote to his cousin shortly after he had begun reading the book (Darwin, 1869): Down. | Beckenham | Kent. S.E. Dec. 23d My dear Galton I have only read about 50 pages of your Book (to the Judges) but I must exhale myself, else something will go wrong in my inside. I do not think I ever in all my life read anything more interesting & original. And how well & clearly you put every point! George [Darwin’s son], who has finished the Book, & who expressed himself just in the same terms, tells me the earlier chapters are nothing in interest to the latter ones! It will take me some time to get to these latter chapters, as it is read aloud to me by my wife, who is also much interested.— You have made a convert of an opponent in one sense, for I have always maintained that, excepting fools, men did not differ much in intellect, only in zeal & hard work; & I still think there is an eminently important difference. I congratulate you on producing what I am convinced will prove a memorable work.— I look forward with intense interest to each reading, but it sets me thinking so much that I find it very hard work; but that is wholly the fault of my brain & not of your beautifully clear style.— Yours most sincerely | Ch. Darwin In addition to exploring the effects of heredity on shared genius within families, Galton also tried to understand the mechanism behind heredity; at this point

160 Brian E. Clauser

Mendel’s work had been published in an obscure journal (Mendel, 1866), but was essentially unknown to the broader European scientific community. This led to an effort to experimentally test Darwin’s theory of inheritance, the theory of pangenesis, and to the development of Galton’s own theory. Both Darwin’s and Galton’s theories failed empirical evaluation, but in the end, the more general laws governing heredity in populations were of greater interest to Galton. Galton’s examination of the effects of heredity on populations led to the insight that the characteristics of subsequent generations regressed to the mean. In his autobiography, he summarized the important steps that took him from a conceptual solution to a practical problem related to heredity through the collection of supporting evidence and finally to regression (Galton, 1908). He begins: The following question had been much in my mind. How is it possible for a population to remain alike in its features, as a whole, during many successive generations, if the average produce of each couple resemble their parents? Their children are not alike, but vary: therefore some would be taller, some shorter than their average height; so among the issue of a gigantic couple there would be usually some children more gigantic still. Conversely as to very small couples. But from what I could thus far find, parents had issue less exceptional than themselves. I was very desirous of ascertaining the facts of the case. After much consideration and many inquiries, I determined, in 1885, on experimenting with sweet peas, which were suggested to me both by Sir Joseph Hooker and by Mr. Darwin. … I procured a large number of seeds from the same bin, and selected seven weights, calling them K (the largest), L, M, N, O, P, and Q (the smallest), forming an arithmetic series. I persuaded friends living in various parts of the country, each to plant a set for me. … The result clearly proved Regression; the mean Filial deviation was only one-third that of the parental one, and the experiments all concurred. The formula that expresses the descent from one generation of a people to the next, showed that the generations would be identical if this kind of Regression was allowed for.4 Galton’s real interest, of course, was in humans, not sweet peas. The problem was that there were no readily available data sets. To remedy this situation, Galton arranged for his own data collection by publishing a slim volume, Record of Family Faculties (1884). The book provided instructions and 50 pages of forms to record detailed information about family members with the intention that the forms be completed and returned to Galton. He offered £500 in prizes for individuals providing useful information. At roughly the same time, Galton opened the anthropometric laboratory at the International Health Exhibition in London. He published a pamphlet (Galton, 1884) in which described the purpose of the laboratory:

A History of Classical Test Theory 161

The object of the Anthropometric Laboratory is to show to the public the great simplicity of the instruments and methods by which the chief physical characteristics may be measured and recorded. The instruments at present in action deal with Keenness of sight; Colour Sense; Judgment of Eye; Hearing; Highest Audible Note; Breathing Power; Strength of Pull and Squeeze; Swiftness of Blow; Span of Arms; Height standing and sitting; and weight. Such is the ease of working the instruments that a person can be measured in these respects, and a card containing the results furnished to him, and a duplicate made and preserved for statistical purposes, at a total cost of 3d. (p. 3) The results of the Record of Family Faculties, the Anthropometric Laboratory, and other data collection efforts provided ample data for Galton to refine his understanding of regression, and in 1886 he contributed two papers on familial relationships to the Royal Society. These focused his attention on tables representing the relationship between deviations in measures such as stature of the adult child and the same measures for parents. Galton apparently appreciated that a clear relationship existed, but he was at a loss to quantify that relationship. Again, in his autobiography he wrote, “At length, one morning, while waiting at a roadside station near Ramsgate for a train, and poring over the diagram in my notebook, it struck me that the lines of equal frequency ran in concentric ellipses” (Galton, 1908). He returned to London and visited the Royal Institution with the intention of renewing his lost knowledge of conic sections. While he was so engaged, a chance conversation with the physicist James Dewar led to the suggestion that Dewar’s brother-in-law, the mathematician J. Hamilton Dickson, would likely be happy to work on Galton’s problem. Dickson viewed it as a simple problem and his solution was presented as an appendix to a paper by Galton. This collaboration resulted in an index of correlation. In the end, Galton had succeeded not only in making an important contribution to the understanding of heredity, but also in providing the mathematical tools to explore a wide range of phenomena. If Galton was aware of Spearman’s work applying correlational techniques to the study of intelligence, he likely would not have been surprised. In the introductory chapter of his most important work on inheritance, Natural Inheritance (Galton, 1889), he commented that it would be worth the reader’s time to understand the methodology he used in his study of heredity because of its wide application: It familiarizes us with the measurement of variability, and with the curious laws of chance that apply to a vast diversity of social subjects. This part of the inquiry may be said to run along a road on a high level, that affords wide views in unexpected directions, and from which easy descents may be made to totally different goals to those we have now to reach. (p. 3)

162 Brian E. Clauser

Galton’s great discoveries were the phenomenon of regression to the mean and the powerful analytic tool of correlation. Although the decision to attribute this discovery to Galton is somewhere between arbitrary and controversial (Pearson, 1896; 1920; Stigler, 1986), at a minimum he deserves credit for pointing out the broad usefulness of correlation in the social and biological sciences. His own words written in another context describe his role in the history of regression well, “It is a most common experience that what one inventor knew to be original, and believed to be new, had been invented independently by others many times before, but had never become established” (Galton, 1889, p.33). Galton’s concept of correlation—as put forth by Spearman and as it exists today—is critical to classical test theory, but the specifics of the coefficient Galton created have long since been replaced by the more mathematically tractable form developed by Karl Pearson. Galton’s less sophisticated approach was based on median and inter-quartile distance rather than mean and variance. Pearson’s work on the problem of correlation and his general support for Galton’s ideas clearly were an essential part of the foundation on which classical test theory was built. There is, however, another of Galton’s followers whose contributions to the story precede those of Pearson.

Francis Ysidro Edgeworth Francis Edgeworth contributed to the development of test theory in three ways. First, by helping to bring the importance of Galton’s work to Pearson’s attention he facilitated statistical developments that made contributions to numerous fields. Second, he directly developed the techniques put forth by Galton to produce procedures for use in statistical analysis of social science data. Finally, he wrote groundbreaking (if generally ignored) papers on the application of statistics to examinations; that work includes elements that presage classical test theory, generalizability theory, and Fisher’s development of analysis of variance. Edgeworth attended Trinity College, Dublin and Oxford University where he studied ancient and modern languages; he subsequently read law and qualified to be a barrister. He appears, however, to have been self-taught in the areas in which he made his reputation: statistics and economics. His work in economics earned him a chair at Oxford. Galton and Edgeworth shared an interest in bringing mathematical analysis into the study of relationships in economics and the social sciences. The more senior Galton began corresponding with Edgeworth and encouraging his work early in Edgeworth’s career. Edgeworth, in turn, became one of Galton’s earliest disciples (Kendall, 1968; Stigler, 1986). Clearly, Edgeworth was a man of considerable genius, creativity, and productivity, but for all his innovation, he seems to have had a relatively modest long-term influence in any of the fields to which he contributed. One reason that Edgeworth’s work may have had a limited influence is his evident love of the English language. He seems to have avoided simple statements whenever possible.

A History of Classical Test Theory 163

One review of his first book praised the content, but lamented the writing style, “The book is one of the most difficult to read which we ever came across…” (Jevons, 1881, p. 581). The reviewer goes on to say, “His style, if not obscure, is implicit, so that the reader is left to puzzle out every important sentence like an enigma” (p. 583). Some of Edgeworth’s style is captured in the opening lines of his 1888 paper on examinations: That examination is a very rough, yet not wholly inefficient test of merit is generally admitted. But I do not know that anyone has attempted to appreciate with any approach to precision the degree of accuracy or inaccuracy which is to be ascribed to the modern method of estimating proficiency by means of numerical marks. (p. 600) The then “modern method” was a system in which essays or short answer papers were scored by content experts. Edgeworth had participated in such scoring activities and had also been examined using these procedures. He brought to bear his knowledge of such examinations along with an understanding of the theory of errors as developed in other measurement settings. The result is an estimate of the extent to which chance influences examination scores. What is remarkable is the approach he took to produce that estimate. He introduced numerous conceptualizations that are now taken for granted. Perhaps the most fundamental is the idea that the mean of numerous flawed measures can be taken as the true value (i.e., true score). Using as an example a set of judgments about the merit of a Latin translation he writes: This central figure which is … assigned by the greatest number of equally competent judges, is to be regarded as the true value of the Latin prose; just as the true weight of a body is determined by taking the mean of several discrepant measurements. There is indeed this difference between the two species of measurement, that in the case of material weight we can verify the operation. We can appeal from kitchen scales to an atomic balance, and show that the mean of a great number of rough operations tends to coincide with the value determined by the more accurate method. But in intellectual ponderation we have no atomic balance. We must be content without verification to take the mean of a number of rough measurements as the true value. (Edgeworth, 1888; p. 601) Edgeworth (1888) went on to describe error in terms of variability about this true value and he concluded that, “it is pretty certain” that for aggregate scores error “will fluctuate according to the normal law … figured as the ‘gensd’armes’ hat’” (p. 604). In

164 Brian E. Clauser

short, the paper provided all the conceptual essentials of classical test theory with an observed score composed of a true score and a normally distributed error term, and with true score defined as the expected value across numerous replications of the measurement process. All that is missing is the inclusion of the correlation coefficient. Edgeworth then proceeded to present the essentials of generalizability theory: 1.

2.

3.

4.

He described the total error impacting an examinee’s score as being made up of multiple facets including what we would now call task effects, rater effects, rater-by-task, and person-by-task interactions. He conceptualized tests as being constructed through random sampling from a defined domain (the central assumption of generalizability theory), and he discussed effects that are not included in his model (hidden facets in current terminology) as well as conditions of testing that would violate the model. He described the total error as the square root of the sum of the squared effects and notes that the impact of any effect will be reduced by the square root of the number of observations. He went on to discuss designs that minimize certain effects without increasing the total amount of rater time required.

Because he acted as an evaluator it is not surprising that he began by examining error introduced by judges or raters. Having considered the contribution that raters make to measurement error, Edgeworth then examined the contribution of those facets of the measurement process that typically would be included in the residual term in generalizability analysis: the person-by-task interaction and other effects which typically are unaccounted for in the design such as the person-byoccasion effect and other higher-order interactions. We have so far been endeavouring to estimate the error which is incurred in appreciating the actual work done by the candidate. We have now to evaluate the error which is committed in taking his answers as representative of his proficiency. (p. 614) As if this were not enough for one paper, Edgeworth finally discussed the possibility of making statistical adjustments for raters. He suggested systematic or random assignment of papers to raters as a way of adjusting for differences in both the mean and distribution of individual raters. The suggestion that Edgeworth presented the basics of generalizability theory in this paper may seem like an exaggeration because generalizability theory typically is presented in the framework of analysis of variance and Fisher did not publish his ideas until decades after Edgeworth’s paper (e.g., Fisher, 1925). This reality only serves to make Edgeworth’s contribution more intriguing: he

A History of Classical Test Theory 165

presented his own conception of analysis of variance! Again, this work has gone largely unrecognized. Stigler (1986) writes, “[His] solution was insightful and foreshadowed much of twentieth-century work on the analysis of variance, but it was and remains extremely hard to decipher” (p. 314). As with his other efforts, Edgeworth’s writing style seems to have gotten in the way of communication. For example, in describing the interpretation of a specific cell in a table—with one value printed vertically and one horizontally—he writes, “We come then to a cell in which there are two denizens, one erect and one recumbent” (Edgeworth, 1885, p. 635). This one paper would seem to be a more than sufficient contribution to the creation of test theory, but there remains an additional contribution that deserves comment: In 1892 and 1893, Edgeworth published two papers on correlation, and coined the term “coefficient of correlation”. In Pearson’s 1896 paper presenting the product moment correlation, he credits Edgeworth’s work as an important predecessor of his own approach. Like so much of Edgeworth’s work, the papers are far from straightforward, but the second paper does present an approach very similar to that presented by Pearson. In 1920, Pearson published a history of correlation and by that time he had decided that Edgeworth had not made a meaningful contribution. In that same paper, he withdraws credit that he had originally given to Bravais for being the first to present the “fundamental theorems of correlation”. There has, over the years been some controversy around who deserves credit for the invention of correlation and the correlation coefficient. It is beyond the scope of this chapter to resolve that controversy, but it is completely appropriate to introduce Pearson in the context of a controversy; he seems to have thrived on public disagreement with his contemporaries. Ironically, in this case he is responsible for championing both sides of the controversy (albeit at different times).

Karl Pearson Like Galton, Pearson studied mathematics at Cambridge. Unlike Galton, he excelled in those studies and achieved the rank of Third Wrangler—indicating third place in the competitive examination for undergraduates studying math(s) at Cambridge. He subsequently studied medieval and 16th century German literature and went on to read law at Lincoln’s Inn, but like Edgeworth he never practiced. He became professor at University College London and later held the Galton Chair of Eugenics. Pearson was an incredibly productive researcher and author; an annotated bibliography of his works contains 648 entries (Morant & Welch, 1939). Pearson made important contributions to statistics; he is responsible for the chisquared test (1900) and for the product moment correlation coefficient (1896) which was critical for Spearman’s development of test theory. In that paper, he

166 Brian E. Clauser

defined the product moment correlation using the following equation and argued for the superiority of this approach to measuring correlation: R ¼ SðxyÞ=ðn1 2 Þ: Pearson fought tirelessly against opposing views with no apparent interest in compromise or synthesis. He feuded with R. A. Fisher from early in Fisher’s career until his own death (Clauser, 2008).5 His last published paper was yet another attack on Fisher (Pearson, 1936). Fisher held up his end of the feud even longer, continuing his attack long after Pearson’s death in 1936 (Fisher, 1950, p. 29.302a). When Mendel’s work was rediscovered and brought to bear on the problem of Darwinian evolution, Pearson raged against Bateson and other supporters of the Mendelian model. Whatever else may be said of Pearson, he was a committed follower of Galton. He spent much of his career promoting the power of correlation and regression to study a range of scientific endeavors, especially heredity and eugenics. Given this long-term devotion to Galton’s ideas, it is interesting to note that— unlike Edgeworth—he did not immediately see the importance of Galton’s ideas. Shortly after the publication of Natural Inheritance, Pearson prepared a lecture for the Men and Women’s Club. The 25 hand-written pages of this lecture that remain in the archive of University College London summarize the content of the book and provide comment on Pearson’s view of the book’s contribution (Pearson, 1889). The manuscript makes it clear that Pearson did not see the potential usefulness of Galton’s methodology. Rather than encouraging his audience to apply the new methodology to a range of challenges, he comments: Personally I ought to say that there is, in my opinion, considerable danger in applying the methods of exact science to problems in descriptive science, whether they be problems of heredity or of political economy; the grace and logical accuracy of the mathematical processes are apt to so fascinate the descriptive scientist that he seeks for sociological hypotheses which fit his mathematical reasoning and this without first ascertaining whether the basis of his hypothesis is as broad as that human life to which his theory is to be applied. I write therefore as a very partial sympathizer with Galton’s methods. Pearson’s subsequent decades-long devotion to Galton’s ideas appears to have caused him to forget that he originally had a mixed response to Galton’s work. The following lines are from a speech Pearson delivered at a dinner in his honor in 1934: In 1889 [Galton] published his Natural Inheritance. In the introduction of that book he writes: This part of the inquiry may be said to run along a road on a high level, that affords wide views in unexpected directions, and from which easy descents may be made to totally different goals to those we have now to reach.

A History of Classical Test Theory 167

“Road on a high level”, “wide views in unexpected directions”, “easy descents to totally different goals”—here was a field for an adventurous roamer! I felt like a buccaneer of Drake’s days. … I interpreted that sentence of Galton to mean that there was a category broader than causation, namely correlation, of which causation was only the limit, and that his new conception of correlation brought psychology, anthropology, medicine and sociology in large parts into the field of mathematical treatment. It was Galton who first freed me from the prejudice that sound mathematics could only be applied to natural phenomena under a category of causation. Here for the first time was a possibility—I will not say a certainty of reaching knowledge—as valid as physical knowledge was then thought to be—in the field of living forms and above all in the field of human conduct. (Pearson, 1934; pp. 22–23) This same speech provides a vivid reminder of another part of Galton’s legacy that Pearson advanced, namely support for eugenics. Pearson continued, Buccaneer expeditions into many fields followed; fights took place on many seas, but whether we had right or wrong, whether we lost or won, we did produce some effect. The climax culminated in Galton’s preaching eugenics, and his foundation of the Eugenics Professorship. Did I say “culmination”? No that lies rather in the future, perhaps with Reichskanzler Hitler and his proposals to regenerate the German people. In Germany a vast experiment is in hand and some of you may live to see its results. If it fails it will not be for want of enthusiasm, but rather because the Germans are only just starting the study of mathematical statistics in the modern sense! (Pearson, 1934; p. 23)

Correlational Psychology in 1900 As mentioned previously, Spearman took Pearson’s work on correlation and applied it to test scores, in the process creating the foundation of classical test theory as we know it. Before moving on to describe Spearman’s contributions, it is useful to consider the state-of-the-art in correlational science as applied to psychology in the years immediately preceding Spearman’s publications. Spearman’s work can be viewed as groundbreaking, but it was not completely without precedent. He was not the first to use the correlation coefficient to examine the relationship between test scores; neither was he the first to push the mathematical logic of correlation beyond its application in quantifying the relationship between two observed variables. Two papers from Psychological Review provide examples of prior application of correlation in psychology. The first is Clark Wissler’s PhD dissertation from Columbia University (Wissler, 1901). Lovie and Lovie (2010) speculate that this may have

168 Brian E. Clauser

been the first published study in psychology to use Pearson’s product moment correlation coefficient. Wissler studied Columbia undergraduates with the working hypothesis that different measures of intelligence would be correlated—reflecting an underlying general intelligence. He collected and correlated numerous measures of physical and mental proficiency including reaction time, accuracy in marking out As in a text, speed in marking out As in a text, accuracy in drawing a line, accuracy in bisecting a line, length and breadth of head, and class standing in numerous subjects. The results seem to have surprised Wissler—as well as other psychologists. The strength of these relationships was modest with the exception of those between class standing in different subjects; the correlation between class standing in Latin and Greek was .75. Wissler recognizes that “failure to correlate” may be “due to want of precision in the tests”, but no serious consideration is given to how reliability might be calculated or used to adjust the correlation.6 The second of the two papers from Psychological Review was coauthored by no less a figure than E. L. Thorndike (Aikens, Thorndike & Hubbell, 1902). The paper presents relationships between numerous measures of proficiency. The plan was to evaluate the relationship between proficiencies that might seem more closely linked than the physical and mental measures examined by Wissler. The motivation for the research is understandable, but the approach shows a remarkable lack of statistical sophistication. The “correlations” that are presented have no clear relationship to either the product moment approach presented by Pearson in 1896 or to the rank order approach that would be presented by Spearman (1904a) two years later. Instead, they used what might be described as a makeshift approximation. Further, the paper is devoid of any consideration of how to evaluate, eliminate, or even recognize the impact of systematic or random effects that might impact the correlation of interest. The simplicity of these correlational studies provides a frame of reference for understanding the magnitude of the contribution made by Spearman. Nonetheless, Spearman’s thinking about how to advance correlational science was preceded—and likely influenced—by an important paper from the statistical literature. Just a year after Pearson introduced the product moment correlation, Yule (1897) presented the formula for partial correlation. Correcting the correlation between two variables for the shared influence of a third is a problem that can be viewed as related to Spearman’s subsequent effort to correct correlations for the presence of error in the observed variables. At a minimum, Yule’s paper provides an example of how relatively simple algebraic manipulations can lead to important insights about the interpretation of correlations.

Charles Spearman Unlike many of the other figures discussed in this chapter, relatively little is known about Spearman’s life. The short autobiographical chapter that he wrote provides little detail (Spearman, 1930). Though there have been more recent biographical sketches that represent the result of substantial scholarly effort, these

A History of Classical Test Theory 169

papers do not provide much information that is not available from Spearman’s published works (e.g., Lovie & Lovie, 1996). Additionally, as was typical of scientific writing in the 19th and early 20th centuries, Spearman provided relatively few references in his publications. This makes it difficult to be sure of the influences on his work. Spearman left school at 15. He had an interest in Indian philosophy as a young man, and he joined the army with the hopes of being stationed in India. He completed a two-year “staff college” while he was in the army, but not long after completing these studies (which likely would have put him in line for more rapid advancement in the army) he resigned his commission and moved to Leipzig to study psychology with Wundt. His studies were interrupted when he was recalled to the army with the outbreak of the Boer War, but eventually he returned to Leipzig and completed his doctorate. Like Pearson, he became a professor at University College London (Spearman, 1930; Lovie & Lovie, 1996; Lovie & Lovie, 2010). Pearson’s main interest was in understanding the nature of intelligence. Although his most enduring contribution to science may be the methodological tools he developed for psychometrics, this work was in the service of empirical evaluation of the nature of intelligence and specifically the theory that there is a general factor that underlies all measures of intelligence. Differences in reliability across tests impacted the observed correlations between test scores. Much of Spearman’s methodological work was focused on accounting for this effect.7 This chapter began with a reference to Gulliksen’s statement that, “Nearly all the basic formulas that are particularly useful in test theory are found in Spearman’s early papers”. The earliest of those papers—The Proof and Measurement of Association between Two Things and “General Intelligence” Objectively Determined and Measured—were published in 1904 while Spearman was still working on his dissertation in Leipzig. Interestingly, those papers were unrelated to his dissertation titled Die Normaltäuschungen in der Lagewahrnehmungla which focused on spatial localization, a topic more in line with Wundt’s program of research (Jensen, 1998; Tinker, 1932). The first paper presents variations on what has become known as the Spearman rank-order correlation and argues for the advantages of correlation based on ranks rather than measures. In this paper, Spearman also presented Yule’s formula for partial correlation, which he claims to have derived using an independent approach. Most importantly, he introduced the formula for estimating true score correlation(the correction for attenuation), r1I R11 ¼ pffiffiffiffiffiffiffiffiffiffiffiffi : r11 rII To illustrate the importance of these tools for producing interpretable correlations, Spearman discusses a paper by Pearson in which Pearson concludes that

170 Brian E. Clauser

“the mental characteristics in man are inherited in precisely the same manner as the physical” (pp. 97–98). Spearman points out that if the observed correlations between relatives are similar (which Pearson reported), the actual relationship must be stronger for the psychological characteristics both because they are measured less precisely and because they are more likely to be influenced by the home environment than are physical characteristics such as eye color and head size. In providing the example, Spearman states that, “it is no longer possible to hold up even the Galton-Pearson school as a model to be imitated” (p. 97). This became the opening salvo leading to yet another of Pearson’s numerous feuds. This one continued for nearly a quarter of a century. Pearson’s immediate response was to republish the lecture Spearman referenced with an addendum responding to Spearman’s criticism (Pearson, 1904). The addendum attacks Spearman’s correction for attenuation because it is capable of producing results that exceed unity and calls upon Spearman to provide algebraic proofs of the formulae. Spearman responded to Pearson in 1907 with a Demonstration of Formulae for True Measurement of Correlation. That same year, Pearson published a paper on methods for estimating correlation and again attacked Spearman’s work. Pearson presents less computationally intensive approximations to the product moment formula as well as approaches for correlation based on ranks. In the 37-page report, Spearman’s name appears 37 times. This creates the impression that the real motivation of the publication may have had more to do with attacking Spearman than advancing the practice of correlational science. It should be remembered in this regard that Pearson was editor of Biometrika and the Drapers’ Company Research Memoir series and so could attack his opponents in print without editorial interference. That said, it should be recognized that some of Pearson’s criticisms were correct. The last of Spearman’s important contributions to classical test theory, Correlation Calculated with Faulty Data was published in 1910. It again responded to Pearson’s criticisms and additionally presented the Spearman-Brown formula, RKK ¼

Kr11 : 1 þ ðK  1Þr11

This formula, which estimates the impact on reliability associated with increasing (or decreasing) the length of a test, may be his most important contribution to test theory. The special case associated with doubling the length of a test provides a basis for estimating split-half reliability. This shifted the focus from test-retest to split-half reliability; a change in perspective that ultimately led to coefficient alpha. Immediately following Spearman’s 1910 paper in the British Journal of Psychology is a paper by William Brown (1910a) which presents a second proof for the same

A History of Classical Test Theory 171

formula. Spearman’s and Brown’s names have been linked as though they were collaborators, but in fact a substantial part of Brown’s recently completed dissertation was an attack on Spearman’s work. The dissertation carefully documented Pearson’s previous criticisms of Spearman so that even without acknowledgment it would be evident that Pearson (not Spearman) influenced the work. The acknowledgement, “Professor Pearson has very kindly read the entire thesis in proof, and made several useful suggestions” (p. 1) makes the influence unambiguous (Brown, 1910b). The full dissertation was published privately (1910b) and (in modified form) as a book (1911). The paper in the British Journal is a shortened form of the dissertation. Brown repeats Pearson’s criticisms of the rank order correlation. He then attacks Spearman’s correction for unreliability on the grounds that it assumes that the errors are uncorrelated. Interestingly, much of Spearman’s 1910 paper is a response to these criticisms from Pearson—reiterated by Brown. He focuses on the circumstances in which the rank order correlation makes sense and reminds the reader that the simplified formula for that correlation which he presented was meant as a rough estimate that dramatically reduced computation and so made it useful for practitioners. He then presented designs for data collection that reduced systematic effects and supported the use of his correction formula. Spearman concludes, “On the whole, if we eliminate these misapprehensions and oversights, there seems to be no serious difference of opinion on all these points between Pearson and myself” (p. 288). This would seem to represent Spearman extending an olive branch. It does not seem to have been accepted by Pearson; he continued his attacks for years to come (e.g., Pearson & Moul, 1927). That said, there was one area of agreement between the two rivals: their support of Galton’s most sinister idea. The following quote concisely captures Spearman’s view of eugenics: “One can conceive the establishment of a minimum index (of general intelligence) to qualify for parliamentary vote, and above all, for the right to have offspring” (Hart & Spearman, 1912; p. 79).

Developments after Spearman Despite Pearson’s criticism, Spearman’s ideas became widely accepted— although broad adoption seems to have taken several years. One of the earliest researchers to adopt Spearman’s methods was A. R. Abelson—also at University College London—who published an extensive study on the development of a battery of tests. In this paper, Abelson used Spearman’s methodology and provided what may be the first example of the use of reliability coefficients to determine the number of items to include on a test form (Abelson, 1911). He also provided a theoretical definition of true score that went beyond the practical conceptualization provided by Edgeworth; Abelson wrote that “perfectly true measurements…would be given by an infinite number of such tests pooled together” (p. 302).

172 Brian E. Clauser

By the mid-1920s, a number of researchers had published empirical efforts to evaluate the accuracy of the Spearman-Brown formula. Using a variety of different tests, the researchers compared the predicted reliabilities to actual results produced by changing the test length: Holzinger and Clayton (1925) did so using the Otis Self-administering Test of Mental Ability, Ruch, Ackerson, and Jackson (1926) used spelling words, and Wood (1926) used achievement tests. In each case, the results demonstrated that the change in reliability as a function of the number of items could be accurately predicted using the formula. Remmers, Shock, and Kelly (1927) extended these findings by applying the SpearmanBrown formula to rating scales and showing that the formula predicted the change in reliability as a function of the number of raters completing the scale.

Truman Kelley The next major contributor to the story of test theory was Truman Kelley. Kelley was born in 1884. He completed his doctorate under E. L. Thorndike at Columbia and subsequently taught at Stanford and Harvard. In 1922–23, Kelley spent a sabbatical year with Pearson at the Galton Biometric Laboratory. In this respect, Kelley is part of a clear intellectual lineage leading from Galton through Pearson and Spearman. He also represents a kind of break from that tradition; each of these earlier thinkers developed methodology to answer what they viewed as important scientific questions. For Galton and Pearson the questions were about heredity; for Spearman they were about the nature of intelligence. Kelley, by contrast, appears to differ from the others in that he was motivated by a desire to improve the technology of assessment; he therefore might more aptly be described as a statistician or psychometrician than a psychologist. In this context, Lee Cronbach referred to Kelley as an “obsessive algebraist and statistician…who had no motive save methodologic” (Lee Cronbach, personal communication, December 22, 2000). Kelley’s single major contribution to classical test theory may have been the regression estimate of an examinee’s true score: 1 ¼ X

r1I X1

þ ð1r1I ÞM1 :

In this equation taken from Kelley (1923), the left-hand term represents the estimated true score, r1I represents the reliability of the test score, X is the observed score, and M is the population mean. The formula is practically useful; it is also historically important because it represents an early application of a Bayesian framework to educational measurement (see Levy & Mislevy, Ch. 13, this volume). Kelley was also responsible for formalizing test theory. His published papers provide a range of results. For example, in a 1924 paper he shows that if we have two forms of a test such that x1 ¼ a þ e1 and x2 ¼ a þ e2 , where x is the observed score, a is the true score, and e is a “chance factor”8,

A History of Classical Test Theory 173

r12 ¼ ð2a Þ=ð2x Þ: In a 1925 paper he shows the relationship between the reliability of a measure and the correlation between the measure and the related true score: r11 ¼ √r1I or equivalently r1I ¼ r211 : In addition to numerous papers presenting individual results, Kelley published what may be the first book devoted to mathematical theories of test scores, Interpretation of Educational Measurements (1927). Kelley’s work clearly had a substantial impact on subsequent texts on test theory. His work is heavily cited in both Thurstone’s The Reliability and Validity of Tests (1931) and Gulliksen’s Theory of mental tests (1950). Finally, like Spearman, Kelley went beyond the range of classical test theory and made contributions to factor analysis. This work led to a serious falling out between Kelley and Spearman. Letters in the Harvard archive document the private disagreement the two had over Spearman’s “general factor”. This became more public after the publication of Kelley’s (1928) first major work on factor analysis, Crossroads in the Mind of Man. In May of 1930 Kelley wrote to Spearman, “I have just read with little satisfaction your review of my Crossroads in the Mind of Man” (Kelley, 1930).9 Spearman responded, “As you read ‘with little satisfaction’ my review of the Crossroads, so did I get little from your handling (both what it said and what it did not say) of the Abilities10, and so probably I shall be dissatisfied with what you reply, and then You again with my rejoinder” (Spearman, 1930b). As with the disagreement between Spearman and Pearson, this disagreement between Kelley and Spearman stopped short of creating a rift in their mutual support for eugenics. Kelley was a committed eugenicist. In 1923 he was elected to be on an advisory council to the Eugenics Committee of the United States of America. The council’s work was the promotion of eugenics “through legislation and otherwise” with emphasis on “enacting a selective immigration law” and “securing the segregation of certain classes, such as the criminal defective” (Fisher, 1923). Kelley’s commitment to eugenics continued well beyond the 1920s; his 1961 will created a eugenics trust, conditioning payments to his two sons on a eugenics assessment they and their prospective brides would take to measure their health, intelligence, and character. The trust also provided for additional payments for each child the couple produced (Lombardo, 2014).

G. F. Kuder and M. W. Richardson In 1937 Kuder and Richardson published their landmark paper on reliability. Both men were psychologists and psychometricians. Kuder was the founding

174 Brian E. Clauser

editor of Educational and Psychological Measurement; Richardson was a founder of the Psychometric Society and early editor of Psychometrika. Their paper apparently arose from work they had each been carrying out independently. It begins with a theoretical consideration of the meaning of reliability and progresses through a series of manipulations and the imposition of assumptions to produce the K-R20 and K-R21 (the 20th and 21st equations presented in the paper). The first of these represents a generalized equation for reliability and the latter a form that the authors state can be calculated in two minutes using statistics that are likely to be readily available to the test developer. Kuder and Richardson (1937) reject test-retest reliability in favor of split-half methods. They note, however, that split-half methods will depend on the specific split used, building on the Spearman-Brown approach. They then present a general equation that represents the correlation between two equivalent forms of a test. If a, b, … n and A, B, … N are corresponding items in two hypothetical forms of a test, the tests are equivalent if the items in each pair (a and A, b and B, etc.) are interchangeable in that the difficulties and the inter-item correlations for the items in one form are the same as those for the items in the other form. Based on these assumptions they present the K-R3: P P 2  n1 pq þ n1 rii pq rtt ¼ t : 2t In this equation, rtt is the reliability of interest, ð2t Þ is the variance of the test, p is the difficulty for a specific item, and q is 1-p. They recognize that this formulation is limited because rii cannot be readily determined. Seventeen equations later they present the K-R20, derived from K-R3 with the assumption that all inter-correlations are equal: rtt ¼

n 2t  npq : : n1 2t

Finally, by assuming that all item difficulties are equal they produce the K-R21, rtt ¼

n 2t  n pq : : 2 n1 t

The Kuder and Richardson (1937) paper was important for both practical and theoretical reasons. Practically, it provided an approach to estimating the reliability of a test that was computationally simple. In the days when calculations were carried out by hand, with a slide rule, or with a cumbersome mechanical calculator, this level of computational simplicity is of considerable value. Theoretically, they decisively elaborated the understanding of reliability estimation and in the process made variations on split-half reliability the standard theoretical framework. Their work was, however, extended by two important contributors whose work came later; Lee Cronbach and Fredric Lord require mention both

A History of Classical Test Theory 175

because of their contributions to classical test theory and the understanding of reliability and because they represent import links to measurement models that go beyond classical test theory.

Lee Cronbach and Fredric Lord Lee Cronbach both introduced coefficient alpha and expanded classical test theory into generalizability theory. Cronbach’s 1951 paper on coefficient alpha may be the most cited paper in the history of measurement. According to Google Scholar, it has been cited tens-of-thousands of times. In a sense, the coefficient alpha paper directly extends the Kuder and Richardson (1937) work. K-R20 is a special case of coefficient alpha in that it applies only to items scored 0/1; coefficient alpha is not limited in this way. Cronbach shows that coefficient alpha (and by extension K-R20) represents the mean of all possible split-half coefficients. It also is “the value expected when two random samples of items from a pool like those in the given test are correlated” (p. 331). Near the end of his life, Cronbach reflected on coefficient alpha (Cronbach, 2004). He commented that although reliability coefficients were appropriate for application in the kind of correlational psychology that was prevalent from Spearman’s time through the first half of the 20th century, current applications in measurement are better served with an estimate of the standard error rather than a reliability index. In that same paper, Cronbach strongly advocated that estimation of measurement error should be based on analysis using variance components rather than simple correlations. This approach is developed in detail in his work on generalizability theory (Brennan, Ch. 10, this volume; Cronbach, Rajaratnam, & Gleser, 1963; Cronbach, Gleser, Nanda & Rajaratnam,1972). The final contributor to the history of classical test theory that requires comment is Fredric Lord. As Cronbach (2004) noted, the coefficient alpha paper “embodied the randomly parallel-test concept of the meaning of true score and the associated meaning of reliability, but only in indefinite language” (p. 402). Lord’s 1955 paper went much further in clarifying and defining the concept of parallel tests. In addition to this advancement in the theory of measurement, Lord and Novick’s 1968 text has come to be viewed as a kind of ultimate reference on classical test theory. Finally, Lord requires mention here because as Cronbach represented a bridge between classical test theory and generalizability theory, Lord represents an important bridge between classical test theory and item response theory.

Concluding Remarks In this chapter, I have mapped the development of classical test theory beginning with the early influences in England. Given the centrality of methodology based on correlation it is clear that any history of classical test theory must reflect the contributions of what might be referred to as the London or Galton school. That

176 Brian E. Clauser

said, what is presented in this chapter is a history of test theory not the history. The history I have presented is far from exhaustive; important figures in the history of measurement—such as Alfred Binet (Binet & Simon, 1916) and Gustav Fechner (Briggs, Ch.12, this volume; Fechner, 1860) who did foundational work on scaling—have been left out altogether. A second issue that warrants mention has to do with the association of many of the figures in the history of classical test theory with the eugenics movement. This is part of the history of testing and measurement and it needs to be acknowledged. It needs to be acknowledged because it is dishonest to ignore that history. It needs to be acknowledged because it helps us to understand the anti-testing movement that has existed for a century (Lippmann, 1922). It needs to be acknowledged because it reminds us of the ways that measurement can be misused. It also needs to be held in context. In the early 20th century the ideas that supported the eugenics movement were part of the mainstream. Being a part of this movement did not set early leaders in testing and measurement apart from much of the rest of society. Finally, I hope this chapter will contribute not just to our knowledge of history, but to our understanding of test theory. When I first learned the basic ideas of classical test theory, I viewed them as if they were laws of nature or eternal truths. I hope this chapter makes it clear that these concepts evolved in a specific way because of the practical problems that interested early contributors such as Galton, Pearson, and Spearman. These concepts could have evolved differently and as Cronbach (2004) pointed out, the way they did evolve may not be fully appropriate for the testing problems that present themselves in this century. I hope that understanding how they evolved will help us to make better use of the tools that classical test theory provides. As Mislevy commented, “History gives us insight into where we are and how we got here, with what seems to me the rather paradoxical result that it is easier for us to modify, extend, or rethink the concepts and tools that have come to us” (Robert Mislevy, personal communication, August 31, 2020)

Notes 1 The author thanks Melissa Margolis, Robert Brennan, Howard Wainer, and Robert Mislevy for their thoughtful comments on an earlier draft of this chapter. 2 For a more complete description of Darwin’s life and work see Browne (1996, 2002) and for a shorter summary of the events leading to the Origin of Species see Quammen (2006). For more information on Wallace see Brooks (1984), Wallace (1869), and van Wyhe & Rookmaaker (2013). 3 See Clauser (2007) for a review of four books about Galton’s life including Gillham (2001) which is an excellent reference. 4 The actual year for the experiments on sweet peas was 1875, not 1885. This is apparently a typographical error. The results of the study were published in 1877.

A History of Classical Test Theory 177

5 Among other reasons for this feud is the fact that Fisher published a paper that pointed out that Pearson did not correctly incorporate degrees of freedom into his interpretation of the chi-squared test. 6 Although Wissler is the sole author on the published paper, these data were collected as part of a lengthy testing program implemented by James McKeen Cattell at Columbia University. Wissler’s results seem to have put an end to that program (Sokal, 1990; Wissler, 1944). 7 Although it is secondary to the history of classical test theory, it is worth noting that Spearman was one of the pioneers of factor analysis as well. Wolfle’s (1940) monograph on Factor Analysis to 1940 lists 137 publications by Spearman. Again, these methodological developments were in support of his study of the nature of intelligence. 8 The original equation given by Kelley did not include a subscript for the variance in the denominator. The subscript has been added because this form is likely to be more familiar to contemporary readers. 9 The review appeared in the Journal of the American Statistical Association (Spearman, 1930b). 10 This refers to Spearman’s Abilities of Man published the previous year.

References Abelson, A. R. (1911). The measurement of mental ability of "backward" children. British Journal of Psychology, 4, 268–314. Aikens, H. A., Thorndike, E. L., & Hubbell, E. (1902). Correlations among perceptive and associative processes. Psychological Review, 9(4), 374–382. Binet, A. & Simon, T. (1916). The development of intelligence in children. (Translated by E. S. Kite). Baltimore, MD: Williams and Williams Company. Brooks, J. L. (1984). Just before the origin: Alfred Russell Wallace’s theory of evolution. New York, NY: Columbia University Press. Brown, W. (1910a). Some experimental results in the correlation of mental abilities. British Journal of Psychology, 3, 296–322. Brown, W. (1910b). The use of the theory of correlation in psychology. Cambridge, UK: Printed privately at the University Press. Brown, W. (1911). The essentials of mental measurement. Cambridge, UK: at the University Press. Browne, J. (1996). Charles Darwin: A biography—volume I voyaging. Princeton, NJ: Princeton University Press. Browne, J. (2002). Charles Darwin: A biography—volume II the power of place. Princeton, NJ: Princeton University Press. Clauser, B. E. (2007). The Life and Labors of Francis Galton: Four Recent Books about the Father of Behavioral Statistics (a book review). Journal of Educational and Behavioral Statistics, 32, 440–444. Clauser, B. E. (2008). War, enmity, and statistical tables. Chance, 21, 6–11. Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 297–334. Cronbach, L. J. (2004). My current thoughts on coefficient alpha and successor procedures. Educational and Psychological Measurement, 64, 391–418. Cronbach, L. J., Gleser, G. C., Nanda, H. & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York, NY: John Wiley. Cronbach, L. J., Rajaratnam, N., & Gleser, G. C. (1963). Theory of generalizability: A liberalization of reliability theory. British Journal of Statistical Psychology, 16, 137–163.

178 Brian E. Clauser

Darwin, C. (1859). On the origin of species. London: John Murray. Darwin, C. (1869, December 23). Letter to Francis Galton. Darwin Correspondence Project (Letter no. 7032) accessed on 25 November 2019, https://www.darwinproject.ac. uk/letter/DCP-LETT-7032.xml. Edgeworth, F. Y. (1885). On methods of ascertain variations in the rate of births, deaths and marriages. Journal of the Royal Statistical Society, 48, 628–649. Edgeworth, F. Y. (1888). The statistics of examinations. Journal of the Royal Statistical Society, 51, 346–368. Edgeworth, F. Y. (1892). Correlated averages. Philosophical Magazine. 5th series, 34, 190–204. Edgeworth, F. Y. (1993). Exercises in calculation of errors. Philosophical Magazine. 5th series, 36, 98–111. Fechner, G. T. (1860). Elemente der Psychophysik, Leipzig: Breitkopf and Hartel; English translation by H. E. Adler, 1966, Elements of Psychophysics, Vol. 1, D. H. Howes & E. G. Boring (Eds.), New York, NY: Rinehart and Winston. Fisher, I. (1923, December 21). Letter to Truman Kelley from the Eugenics Committee of the United States of America. Harvard University Archives: Truman Kelley Papers. Cambridge, MA: Harvard University. Fisher, R. A. (1925). Statistical Methods for research workers. Edinburgh, UK: Oliver & Boyd. Fisher, R. A. (1950). Contributions to mathematical statistics. New York, NY: John Wiley. Galton, F. (1853). Tropical South Africa. London: John Murray. Galton, F. (1869). Hereditary genius. London: Macmillan. Galton, F. (1889). Natural inheritance. London: Macmillan. Galton, F. (1892). Finger prints. London: Macmillan. Galton, F. (1893). Decipherment of blurred finger prints. London: Macmillan. Galton, F. (1884). Anthropometric laboratory. London: William Clowes and Sons. Galton, F. (1884). Record of family faculties. London: Macmillan. Galton, F. (1895). Finger print directories. London: Macmillan. Galton, F. (1908) Memories of my life. London: Methuen. Galton, F. (1909). Essays in eugenics. London: Eugenics Education Society. Gulliksen, H. (1950). Theory of mental tests. New York, NY: John Wiley. Gillham, N. W. (2001). A life of Sir Francis Galton: From African exploration to the birth of eugenics. New York, NY: Oxford University Press. Hart, B., & Spearman, C. (1912). General ability, its existence and nature. British Journal of Psychology, 5, 51–84. Holzinger, K. J., & Clayton, B. (1925). Further experiments in the application of Spearman’s prophecy formula. Journal of Educational Psychology, 16, 289–299. Jastrow, J. (1892, July 17). Letter to Francis Galton. University College Archives. London: University College. Jensen, A. R. (1998). The g factor: The science of mental ability. Westport, CT: Praeger Publishers. Jevons, W. S. (1881). Review of F.Y. Edgeworth’s Mathematical psychics: An essay on the application of mathematics to the moral sciences. Mind, 6, 581–583. Kelley, T. L. (1921). The reliability of test scores. Journal of Educational Research, 3, 370–379. Kelley, T. L. (1923). Statistical method. New York, NY: Macmillan. Kelley, T. L. (1924). Note on the reliability of a test: a reply to Dr. Crum’s criticism. Journal of Educational Psychology, 15, 193–204. Kelley, T. L. (1925). The applicability of the Spearman-Brown formula for the measurement of reliability. Journal of Educational Psychology, 16, 300–303.

A History of Classical Test Theory 179

Kelley, T. L. (1927). Interpretation of educational measurements. Yonkers-on-Hudson, NY: World Book. Kelley, T. L. (1928). Crossroads in the mind of Man. Stanford University, CA: Stanford University Press. Kelley, T. L. (1930, May 2). Letter to Charles Spearman. Harvard University Archives: Truman Kelley Papers. Cambridge, MA: Harvard University. Kendall, M. G. (1968). Francis Ysidro Edgeworth, 1845–1926. Biometrika, 55, 269–275. Kuder, G. F. & Richardson, M. W. (1937). The theory of estimation of reliability. Psychometrika, 2, 151–160. Lippmann, W. (1922). The mental age of Americans. New Republic 32, 213–215, 246–248, 275–277, 297–298, 328–330, 33, 9–11. Lombardo, P. A. (2014). When Harvard said no to eugenics: The J. Ewing Mears Bequest, 1927. Perspectives in Biology and Medicine, 57, 374–392. Lord, F. M. (1955). Estimating test reliability. Educational and Psychological Measurement, 15, 325–336. Lord, F. M. & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. Lovie, P. & Lovie, A. D. (1996). Charles Edward Spearman, F.R.S. (1863-1945). Notes and Records of the Royal Society of London, 50, 75–88. Lovie, S. & Lovie, P. (2010). Commentary: Charles Spearman and correlation: A commentary on 'the proof and measurement of association between two things'. International Journal of Epidemiology, 39, 1151–1153. Mendel, J. G. (1866). Versuche über Plflanzen-hybriden. Verhandlungen des naturforschenden Ver-eines in Brünn, Bd. IV für das Jahr 1865, Abhand-lungen, 3–47. Morant, G. M. & Welch, B. L. (1939). A Bibliography of the Statistical and Other Writings of Karl Pearson. Issued by the Biometrika Office, University College, London. Cambridge, UK: Cambridge University Press. Pearson, K. (1896). Mathematical contributions to the theory of evolution, III: regression, heredity and panmixia. Philosophical Transactions of the Royal Society of London (A), 187, 253–318. Pearson, K. (1889, March, 11). On the laws of inheritance according to Galton: Lecture delivered to the Men’s and Women’s Club. UCL Archives: Karl Pearson Papers (PEARSON/1/5/19/1). London: University College London. Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine. 5th series, 50, 157–175. Pearson, K. (1904). On the laws of inheritance in man. II. On the inheritance of mental and moral characters in man, and its comparison with the inheritance of physical characters. Biometrika, 3, 131–190. Karl Pearson (1907). Mathematical contributions to the theory of evolution, XVI. On further methods of determining correlation. Drapers’ Company Research Memoirs. Biometric Series IV. London: Dulau & Co. Pearson, K. (1920). Note on the history of correlation. Biometrika, 13, 25–45. Pearson, K. & Moul, M. (1927). The mathematics of intelligence. I. The sampling errors in the theory of a generalized factor. Biometrika, 19, 246–291. Pearson, K. (1934). Professor Karl Pearson’s reply. In Speeches Delivered at a Dinner held in University College, London In Honor of Professor Karl Pearson 23 April 1934. (pp. 19–24). Cambridge, UK: Privately Printed at the University Press.

180 Brian E. Clauser

Pearson, K. (1936). Method of moments and the method of maximum likelihood. Biometrika, 28, 34–59. Quammen, D. (2006). The reluctant Mr. Darwin: An intimate portrait of Charles Darwin and the making of his theory of evolution. New York, NY: Atlas Books. Remmers, H. H., Shock, N. W., & Kelly, E. L. (1927). An empirical study of the validity of the Spearman-Brown formula as applied to the Purdue rating scale. Journal of Educational Psychology, 18, 187–195. Ruch, G. M., Ackerson, L., & Jackson, J. D. (1926). An empirical study of the SpearmanBrown formula as applied to educational test material. Journal of Educational Psychology, 17, 309–313. Sokal, M. M. (1990). James McKeen Cattell and mental anthropometry: Ninetheenth century science and reform and the origins of psychological testing. In M. M. Sokal (Ed.), Psychological testing and American society, 1890–1930 (pp. 21–45). New Brunswick, NJ: Rutgers University Press. Spearman, C. E. (1904a). The proofs and measurement of association between two things. American Journal of Psychology, 15, 72–101. Spearman, C. E. (1904b). ”General intelligence” objectively determined and measured. American Journal of Psychology, 15, 201–293. Spearman, Charles, C. (1910). Correlation calculated from faulty data. British Journal of Psychology, 3, 271–295. Spearman, C. (1927). The abilities of man. Oxford, UK: Macmillan. Spearman, C. (1930a). Autobiography. In C. Murchison (Ed.), A history of psychology in autobiography, Vol. 1 (pp. 299–333). Worcester, MA: Clark University Press. Spearman, C. (1930b). Review of Crossroads in the mind of Man by Truman L. Kelley. Journal of the American Statistical Association, 25, 107–110. Spearman, C. (1930, May 5). Letter to Truman Kelley. Harvard University Archives: Truman Kelley Papers. Cambridge, MA: Harvard University. Stigler, S. M. (1986). The history of statistics: The measurement of uncertainty before 1900. Cambridge, MA: The Belknap Press of Harvard University Press. Thurstone, L. L. (1931). The reliability and validity of tests. Ann Arbor, MI: Edwards Brothers, Inc. Tinker, M. A. (1932). Wundt’s Doctorate Students and Their Theses 1875–1920. The American Journal of Psychology, 44, 630–637. Van Wyhe, J. & Rookmaaker, K. (Eds.) (2013). Alfred Russel Wallace: Letters from the Malay Archipelago. Oxford, UK: Oxford University Press. Wallace, A. R. (1858). On the tendency of varieties to depart indefinitely from the original type. Zoological Journal of the Linnean Society, 3, 46–50. Wallace, A. R. (1869). The Malay Archipelago, New York, NY: Harper. Wissler, C. (1901). The correlation of mental and physical tests. Psychological Review: Monograph Supplements, 3(6), 1–62. Wissler, C. (1944). The Contribution of James McKeen Cattell to American anthropology. Science, 99, 232–233. Wolfle, D. (1940). Factor analysis to 1940. Psychometric Monograph Number 3. Chicago, IL: University of Chicago Press. Wood, B. D. (1926). Studies in achievement tests. Part III. Spearman-Brown reliability predictions. Journal of Educational Psychology, 17, 263–269. Yule, G. U. (1897). On the significance of Bravais’ formulæ for regression, &c., in the case of skew correlation. Proceedings of the Royal Society of London, 60, 359–367.

9 THE EVOLUTION OF THE CONCEPT OF VALIDITY Michael Kane and Brent Bridgeman1

This chapter describes four significant trends in validity theory between 1900 and 2020, but focuses on the period from 1955 to 2010. First, we summarize the development of three traditional models for validity in the first half of the Twentieth Century, the content, criterion, and trait or construct models. Second, we trace the gradual development, between 1955 and 1990, of unified models for validity based on an evolving notion of constructs. Third, we examine the increasing importance of questions about fairness and consequences in the evaluation of testing programs since the 1960s. Fourth, we review the development of general argument-based models that explicitly allow for variability in the kinds of evidence needed for the validation of different kinds of testing programs.

Early Developments, 1890–1950 Kelley (1927) is credited with providing the earliest explicit definition of validity in the context of testing, as the extent to which a test “really measures what it purports to measure” (p. 14) and is appropriate for a “specifically noted purpose” (p. 30), but “validity” had been defined as early as 1921 (Sireci, 2016). However, even before 1921, many of the issues and methods now included under the label, “validity”, were invoked. Toward the end of the Nineteenth Century and into the Twentieth Century, researchers sought to develop measures of mental abilities (e.g., general intelligence, memory) that would estimate these abilities more systematically and precisely (i.e., more scientifically) than was ordinarily possible. The abilities were conceptualized mainly in terms of the kinds of tasks assumed to require the mental ability. They were not embedded in an explicitly stated theory, but they were expected to be associated with certain kinds of performance and achievement; for example, students with high intelligence were expected to be

182 Michael Kane and Brent Bridgeman

high achievers in school and life. The proposed measures were evaluated in terms of how well they reflected the mental ability of interest (Spearman, 1904).

Early Applications Tests have always been valued for both the insights they could provide and for their utility in making decisions. By 1920, the possibility of using test scores to predict future performances was recognized. This work on criterion-based predictions focused mainly on applications in selection and placement, with the criteria specified in terms of desired outcomes. Between 1920 and 1950, criterion-related evidence came to be the “gold standard” for validity (Angoff, 1988; Cronbach, 1971; Cureton, 1951; Gulliksen, 1950). The criterion model was supplemented by a content model that focused on the representativeness of the test content. An assessment consisting of a sample of performances from the domain could be used to estimate a test taker’s overall domain performance, as in educational achievement tests, and these tests could be validated in terms of their coverage of the domain (Cronbach, 1971: Ebel, 1961; Ryans and Frederiksen, 1951; Rulon, 1946). In both of these models, the “ability” of interest was essentially a given, defined by a criterion to be predicted or a performance domain to be sampled, and the question was whether the test scores provided accurate indications of the ability. Over time, the problems inherent in specifying the ability of interest came to be recognized and more attention was given to explicating the interpretations and uses of scores.

Validity Theory in the Early 1950s Cureton (1951) began his chapter on validity in the first edition of Educational Measurement by associating validity with usefulness for some purpose: The essential question of test validity is how well a test does the job it is employed to do. … Validity is always validity for a particular purpose. It indicates how well the test serves the purpose for which it is used. (Cureton, 1951, p. 621) Cureton’s exposition was highly practical and highly empirical, but he also considered what the observations mean in terms of underlying abilities, suggesting, as an example, that a vocabulary test could be a reasonably valid indicator of verbal intelligence for children with fairly equal opportunities and incentives to learn word meanings, but for children with varied educational backgrounds, “it may be more valid as an indicator of the general quality of previous instruction in reading than as an indicator of verbal intelligence” (Cureton, 1951, p. 621–22). Note that Cureton suggested an interpretation of the scores in terms of a trait, “verbal intelligence”, but no theory of intelligence is proposed. He also indicated that

The Evolution of the Concept of Validity 183

validity was not simply a property of the test but depended on the intended interpretation and use of the scores, and the population of test takers. Although Cureton recognized that the interpretation of test scores would involve assumptions about the meaning of performance regularities, he adopted a strongly empirical, even operational, stance: We must not say that his high score is due to his high ability, but if anything the reverse … His “ability” is simply a summary statement concerning his actions. (Cureton, 1951, p. 641) Cureton (1951) did not rely on traits to explain behavior, but rather interpreted them as labels for dispositions to behave in a certain way. As noted below, the notion of a “trait” has been commonly used in this way.

The Criterion Model The criterion model provided a simple methodology for evaluating validity, and as a bonus, it yielded a quantitative index of validity. Gathering the required data could be difficult, but the basic procedure was simple. Obtain scores on the test and on a criterion assessment, preferably a behavioral measure (Anastasi, 1950), for a fairly large and representative sample of persons, who were not selected on the basis of the test scores, and compute a correlation between the two sets of scores. By the 1950s, the criterion-based methodology had become very sophisticated (Cureton, 1951; Gulliksen, 1950), and if a good criterion were available, it provided a simple, quantitative approach to validation. The main problem with the criterion model was in identifying an appropriate criterion measure. As Ebel (1961) suggested: The ease with which test developers can be induced to accept as criterion measures quantitative data having the slightest appearance of relevance to the trait being measured is one of the scandals of psychometry. (p.642) It can be difficult to obtain a criterion that is clearly better than the assessment itself, and without some way of validating criteria that does not involve other criteria, we face either infinite regress or circularity. One way out of this dilemma is to base the criterion on direct observations of the performance of interest, or a proxy measure validated in terms of its relevance and reliability (Cureton, 1951; Cronbach, 1971; Ebel, 1961; Guion, 1977; Gulliksen, 1950; Kane, Crooks, and Cohen, 1999; Rulon, 1946). Criterion-related validity evidence continues to be important in making a case for the validity of any interpretation or use that involves inferences from test

184 Michael Kane and Brent Bridgeman

scores to some criterion performance (AERA et al., 2014; Zwick, 2006), but criterion-related evidence is now interpreted in a broader context that requires evaluations of fairness and of the positive and negative consequences of test use. (Cronbach, 1988; Guion, 1998; Messick, 1989a).

The Content Model The content model assumed that a valid measure of competence in a domain of performances could be developed by including samples of the performance of interest in the assessment, or of the skills needed for the performance, and is widely used in assessments of achievement. Rulon (1946) emphasized that the validity of an assessment would depend on the use to be made of the scores and that validity would be dependent on the relevance and appropriateness of the content included in the assessment. Content related analyses are still widely used in validating achievement tests. The content-based approach was open to criticism, especially if it was not well defined or carefully implemented (Ebel, 1961; Guion, 1977; Messick, 1989a; Sireci, 1998), but a plausible argument could be made for interpreting scores based on samples of performance of some kind in terms of level of skill in that kind of performance (Cronbach, 1971).

Traits From the 1890s to the early 1950s, assessment scores were often interpreted in terms of traits, which were conceived of as tendencies to perform in certain ways in response to certain kinds of stimuli or tasks. According to Cureton (1951), if the item scores on a test correlate substantially, the sum of the item scores can be taken as a measurement of: Whatever … is invoked in common by the test items as presented in the test situation. This “whatever” may be termed a “trait.” … The existence of the trait is demonstrated by the fact that the item scores possess some considerable degree of homogeneity. (p. 647) Traits could vary in terms of the behaviors involved, in their generality (e.g., spelling vs. intelligence), and in their stability, but they shared three characteristics (Campbell, 1960; Cureton, 1951). First, they were defined in terms of kinds of performance. Second, no theories or causal mechanisms need be specified. Third, most traits were interpreted as enduring attributes of persons. Traits have played major roles in educational and psychological measurement. The mental abilities discussed in the last section were traits. The true scores of classical test theory (Lord and Novick, 1968), the universe scores of generalizability theory

The Evolution of the Concept of Validity 185

(Brennan, 2001a, 2001b), the latent traits of item-response theory (Lord, 1980), and the factors in factor analysis (McDonald, 1985) reflect traits. In general, validity theory in the early 1950s was highly empirical, focusing mainly on the criterion model. The content model provided a model for validating interpretations in terms of performance domains, and in addition, provided a way to develop criteria for criterion-based validation. In addition, trait interpretations were widely used, but the traits were generally defined in terms of kinds of tasks or kinds of behavioral dispositions and were validated in terms of relevance and reliability. Validity addressed both score interpretations and score uses (e.g., in prediction and diagnosis). The requirements for validation depended on “how well a test does the job it is employed to do” (Cureton, 1951, p. 621).

Construct Validity, 1954–55 The model proposed by Cronbach and Meehl (1955) in their landmark paper, Construct Validity in Psychological Tests, has not been much used in practice, but its central ideas have had a pervasive influence on validity theory. It was introduced mainly to provide a validation framework for the kinds of trait interpretations used in personality theory and in clinical psychology. The APA Committee on Psychological Testing was charged with outlining the kinds of evidence needed to justify the kind of “psychological interpretation that was the stock-in-trade of counselors and clinicians” (Cronbach, 1989, p. 148). They introduced the basic ideas of construct validity, which were incorporated in the Technical Recommendations (American Psychological Association, 1954), and were more fully developed in a subsequent paper (Cronbach and Meehl, 1955). As Cronbach (1971) later summarized its origins: The rationale for construct validation (Cronbach & Meehl, 1955) developed out of personality testing. For a measure of, for example, ego strength, there is no uniquely pertinent criterion to predict, nor is there a domain of content to sample. Rather, there is a theory that sketches out the presumed nature of the trait. If the test score is a valid manifestation of ego strength, so conceived, its relations to other variables conform to the theoretical expectations. (pp. 462–463) The goal was to put the validation of traits on firmer ground. In the early 1950s, the hypothetico-deductive model of theories provided the dominant framework for evaluating theoretical interpretations (Suppe, 1977). This model treated theories as interpreted sets of axioms. The core of the theory consists of axioms, or hypotheses, connecting abstract terms, or “constructs”, which were implicitly defined by their roles in the axioms. The axioms and the constructs were interpreted, by connecting some of the constructs to observations, through “correspondence rules” (Suppe, 1977, p.17). The axioms and correspondence rules, and any conclusions derived from these relationships constituted a “nomological

186 Michael Kane and Brent Bridgeman

network” (Cronbach and Meehl, 1955). The validity of the theory and the construct interpretations would be evaluated in terms of how well the theory (with the construct measures) handles empirical challenges. If the empirical predictions derived from the network were confirmed, the construct interpretations and the theory would be validated; otherwise, the construct interpretation or the theory, or both, would be questioned. For Cronbach and Meehl (1955), construct interpretations were based on a scientific theory, which had to be developed, stated clearly, and validated. Both the score interpretation in terms of the construct and the theory were subject to challenge. In effect, Cronbach and Meehl (1955) shifted the focus from the validity of the test for some intended interpretation or use (e.g., as a predictor of some criterion) to the plausibility of the theory-based interpretation and use of the test scores. The model proposed by Cronbach and Meehl (1955) was very elegant, but it has been very difficult to apply in the social sciences, and Cronbach later expressed regret that he and Meehl had tied their model to a particular view of theories (Cronbach, 1989). Nevertheless, the 1955 paper shaped the subsequent development of validity theory. In the 1940s, the term “construct” was rarely if ever used in testing; by the 1980s everything was a construct.

The Evolution of “Construct” Validity, 1955–1989 After 1955, construct validity evolved in two directions. First, it was viewed as one of three main validation models, along with the criterion and content models for validity, each associated with a particular interpretation or use of scores, and each involving particular kinds of evidence. This approach was labeled the “Trinitarian” model by Guion (1980), but we will refer to it as the application-specific approach. Other specific “kinds” of validity were introduced at various times; Newton and Shaw (2013) identified 32 validity modifier labels that have been proposed at various times for different types of validity. The second direction involved the development of a general unified framework for validation based on a much-relaxed version of the construct model (Cronbach and Meehl, 1955). In developing the unified models, some aspects of the construct model were dropped (particularly, the need for a formal theory specified in terms of a nomological network), and some aspects got generalized and made more central and explicit (e.g., the expectation that rival hypotheses would be considered). As a result, the general unified model that emerged (Messick, 1975) was quite different from the model proposed by Cronbach and Meehl (1955). Cronbach and Meehl (1955) presented construct validity as a model to be used, “whenever a test is to be interpreted as a measure of some attribute or quality which is not operationally defined” (1955, p. 282), and for “attributes for which there is no adequate criterion” (1955, p. 299). They presented it as an alternate, specific model, but they also suggested that it involved more fundamental concerns, in that, “determining what psychological constructs account for test

The Evolution of the Concept of Validity 187

performance is desirable for almost any test” (p. 282). They presented construct validity as a fundamental concern, but not as a general framework for validity. The conflict between the application-specific and unified approaches was there from the beginning.

Construct Validity in Science, 1955–1970 In 1957, Jane Loevinger suggested that “construct validity is the whole of the subject from a systematic, scientific point of view” (p.461), because the other models are ad hoc and limited to specific uses. Loevinger was a developmental psychologist and was interested in scientific research, a natural setting for the construct model. It is not clear whether Loevinger was advocating for the adoption of construct validity as a general framework for validity in all contexts, or simply emphasizing its utility in scientific research and downplaying questions about more applied uses like selection and achievement testing. Campbell and Fiske (1959) suggested multitrait-multimethod analyses, in which several traits are each measured using several assessment methods, as a way to evaluate a number of assumptions that are commonly made about traits and trait measures. For example, correlations between measures of a single trait using different methods should be fairly high (i.e., convergent analyses), and correlations between measures of a different traits using a common method should be relatively low (i.e., discriminant analyses). According to the 1966 Standards (APA, AERA, NCME 1966): Tests are used for several types of judgment, and for each type of judgment, a different type of investigation is required to establish validity. … The three aspects of validity corresponding to the three aims of testing may be named content validity, criterion-related validity and construct validity. (p. 12) and Construct validity is ordinarily studied when the tester wishes to increase his understanding of the psychological qualities being measured by the test. (p. 13) The 1966 Standards adopted an application-specific approach, with construct validity focused on psychological traits.

Softening the Construct-Validity Model – Cronbach (1971) In his chapter in the second edition of Educational Measurement, Cronbach (1971) continued to associate construct validation with theoretical variables for which

188 Michael Kane and Brent Bridgeman

“there is no uniquely pertinent criterion to predict, nor is there a domain of content to sample” (p. 462), and suggested that, “A description that refers to the person’s internal processes (anxiety, insight) invariably requires construct validation” (p. 451). Cronbach (1971) also discussed the need for an overall evaluation of validity, which would include many kinds of evidence, including construct-related evidence: Validation of an instrument calls for an integration of many types of evidence. The varieties of investigation are not alternatives any one of which would be adequate. The investigations supplement one another… For purposes of exposition, it is necessary to subdivide what in the end must be a comprehensive, integrated evaluation of the test. (Cronbach, 1971, p445; italics in original) Cronbach (1971) criticized some programs of construct validity as, “haphazard accumulations of data rather than genuine efforts at scientific reasoning” and suggested that: Construct validation should start with a reasonably definite statement of the proposed interpretation. The interpretation will suggest what evidence is most worth collecting to demonstrate convergence of indicators. A critical review in the light of competing theories will suggest important counterhypotheses, and these also will suggest data to collect. Investigations to be used for construct validation, then, should be purposeful rather than haphazard (Campbell, 1960). (Cronbach, 1971. p. 483) This echoes Cronbach and Meehl, but it is much softer. The talk of theories and nomological networks is replaced with talk of “a reasonably definite statement of the proposed interpretation”. Cronbach (1971) envisioned a structured and unified conception of validity that was later more fully elaborated by Messick (1989a) and Kane (2006). But even with his looser and more comprehensive conception of construct validity, Cronbach (1971) maintained that Loevinger’s suggestion that claims of content validity be dropped in favor of construct validation was sound in some contexts, but “much too sweeping” (p. 454).

The 1974 Standards The 1974 Standards (APA, AERA, & NCME 1974) defined validity in terms of “what may properly be inferred from a test score” (p. 25), a general, unified ideal, but it discussed validation in terms of an expanded set of “four interdependent kinds of inferential interpretation” (p.26): predictive, concurrent, content; and construct

The Evolution of the Concept of Validity 189

validities. The construct-validity model was to be reserved for measures of theoretical constructs, where the construct is “a dimension understood or inferred from its network of interrelationships” (p. 29).

Meaning and Values in Measurement – Messick (1975) Messick (1975) quoted Loevinger (1957) to the effect that, from a scientific point of view, construct validity is the whole of the subject, and he maintained that, in contrast with more specific models that focus on specific interpretations and uses, construct validation involves hypothesis testing and “the philosophical and empirical means by which scientific theories are evaluated” (p. 956): Construct validation is the process of marshalling evidence in the form of theoretically relevant empirical relations to support the inference that an observed response consistency has a particular meaning. The problem of developing evidence to support an inferential leap from an observed consistency to a construct that accounts for that consistency is a generic concern of all science. (Messick, 1975, p. 955) Messick (1975) was still treating the construct validity model as the first among other models, as a generic concern in science, rather than a general framework, and he focused on its use in scientific contexts. Messick (1975) was also loosening the idea of a construct. He suggested that in order to evaluate an interpretation or use of scores, it is necessary to be clear about the construct meanings and associated values, but he did not require that the construct be embedded in a theory. In broadening thinking about constructs, he drew attention to the importance of values and consequences, and suggested that, in considering any test use, two questions were of central concern: First, is the test any good as a measure of the characteristic it is interpreted to assess? Second, should the test be used for the proposed purpose? The first question is a technical and scientific one and may be answered by appraising evidence bearing on the test’s psychometric properties, especially construct validity. The second question is an ethical one, and its answer requires an evaluation of the potential consequences of the testing in terms of social values. (Messick, 1975, p. 960) Messick was strongly committed to the importance of values throughout his career. Messick (1975) gave the evaluation of plausible rival hypotheses a central role in validation and concluded that, “If repeated challenges from a variety of

190 Michael Kane and Brent Bridgeman

plausible rival hypotheses can be systematically discounted, then the original interpretation becomes more firmly grounded” (Messick, 1975, p. 956), and he suggested that convergent and discriminant analyses could be used to rule out alternate hypotheses (Campbell and Fiske, 1959). Embretson (1983) drew an insightful distinction between two kinds of interpretation: construct representation refers to the model-specific processes and structures (i.e., a cognitive theory of performance) that can be used to account for test taker performances, and nomothetic span refers to the network of relationships that support inferences to other variables. Both of these theory-based interpretations can provide a basis for construct validation, but the kinds of evidence needed to validate the interpretations differ. There are contexts where one of these two theory types predominates and so any strong version of construct validity may not provide a unified framework for validation.

Applications of Construct Validity in the 1970s Confirmatory factor analysis (Jöreskog, 1973) can be interpreted in terms of Cronbach and Meehl’s (1955) model for construct validation. The confirmatory factor model postulates relationships between latent variables, or constructs, with some theory-based constraints on the factor structure, and the model is checked by fitting it to appropriate empirical data. If the model does not fit the data, either the postulated assumptions or the validity of the assessments must be questioned. In 1979, the federal agencies responsible for enforcing civil-rights laws published Uniform Guidelines (EEOC et al., 1979), which promoted the use of criterion-related evidence for the validation of employment tests. The Guidelines allowed for the use of content-based and construct-based analyses, but preferred criterion-related analyses, and thus enshrined an application-specific framework in legal analyses of fairness in employment testing. In practice, construct validity was not treated as a general, unified framework for validity in the 1970s, and when it was used to evaluate testing programs, it was rarely applied in a rigorous way. As Cronbach lamented: The great run of test developers have treated construct validity as a wastebasket category. In a test manual, the section with that heading is likely to be an unordered array of correlations with miscellaneous other tests and demographic variables. Some of these facts bear on construct validity, but a coordinated argument is missing. (Cronbach, 1980b, p. 44) Although Messick, Cronbach, and others were moving toward a more general, unified conception of validity, practice still focused on specific models tied to specific interpretations and uses. According to Angoff (1988):

The Evolution of the Concept of Validity 191

In essence then, validity was represented, even well into the 1970s as a three-categorized concept and taken by publishers and users alike to mean that tests could be validated by any one or more of the three general procedures. (Angoff, 1988, p. 25) Validity theorists (Anastasi, 1986; Cronbach, 1980a; Guion, 1977, 1980; Messick, 1975, 1980) were concerned that the separate models in the application-specific approach did not provide any clear, consistent standards for validity, but practice continued to focus on the application-specific models.

The 1985 Standards – Victory (of Sorts) for the Unified View The 1985 Standards (AERA, APA, and NCME, 1985) characterized validity as a “unified concept”, while accepting that different interpretations and uses would require different kinds of evidence: Validity, however, is a unitary concept. Although evidence may be accumulated in many ways, validity always refers to the degree to which that evidence supports the inferences that are made from the scores. (p. 9) Validity was taken to be a unitary concept, but the introduction to the chapter was divided into sections for different kinds of evidence: background, constructrelated evidence, content-related evidence, criterion-related evidence, validity generalization, and differential prediction. Evidence in the “construct-related category” would focus on the “psychological characteristic of interest” (AERA et al., 1985, p. 9) and the construct should be embedded in a conceptual framework: The conceptual framework specifies the meaning of the construct, distinguishes it from other constructs, and indicates how the measure of the construct should relate to other variables. (AERA et al., 1985, pp. 9–10) The discussion of construct-related evidence remained close to Cronbach and Meehl (1955), with a “conceptual framework” instead of a formal theory. At about the same time, Anastasi (1986) suggested that content analyses and correlations with external criteria fit into particular stages in the process of construct validation, that is, in the process of both determining and demonstrating what a test measures. (Anastasi, 1986, p.4)

192 Michael Kane and Brent Bridgeman

The 1985 Standards and much of the subsequent literature on validity theory defined validity as a unitary concept but did not provide much guidance on how to combine different kinds of evidence in validation (Moss, 1995). The evidence was to span the three traditional categories, more evidence would be better than less, quality was important, and the evidence should be chosen in light of intended use, but there was little explicit guidance on how all of this was to be done. The 1999 Standards for Educational and Psychological Testing continued in this vein, defining validity as the degree to which evidence and theory support the interpretation of test scores entailed by proposed uses of tests. … The process of validation involves accumulating evidence to provide a sound scientific basis for the proposed score interpretations. (AERA, APA, NCME, 1999, p. 9) Validity theorists wanted a more unified, principled, and consistent approach to validation. A major development during the 1980s that did provide explicit guidance for validation was an increasing emphasis on empirical challenges to proposed interpretations of test scores. The notion of systematic error has a long history in the physical sciences, but it became especially relevant and explicit in validity theory in the 1980s and 90s (Cook and Campbell, 1979; Messick, 1989a; AERA et al., 1999), in terms of two kinds of systematic errors. Messick (1989a, p. 34) defined constructirrelevant variance as “excess reliable variance that is irrelevant to the interpreted construct,” (p. 34), and he defined construct underrepresentation as occurring if “the test is too narrow and fails to include important dimensions or facets of the construct” (p. 34). These two types of systematic error reflect construct validity’s focus on challenging proposed construct interpretations, empirically and conceptually. If any serious source of construct-irrelevant variance or construct underrepresentation is found, or plausibly suspected, the intended interpretation is undermined.

The Strong and Weak Programs of Construct Validity, 1988–89 In chapters published in 1988 and 1989, Cronbach drew a distinction between the original strong program of construct validity and a weak program of construct validity: Two concepts of CV were intermingled in the 1954 Standards: a strong program of hypothesis-dominated research, and a weak program of Dragnet empiricism: “just give us the facts, ma’am … any facts”. The CM paper unequivocally sets forth the strong program: a construction made explicit, hypotheses deduced from it, and pointed relevant evidence brought in. This is also the stance of the 1985 Standards. (Cronbach, 1989, p. 162; italics and abbreviations in original)

The Evolution of the Concept of Validity 193

And he favored the strong program: The strong program … calls for making one’s theoretical ideas as explicit as possible, then devising deliberate challenges. Popper taught us that an explanation gains credibility chiefly from falsification attempts that fail. (Cronbach, 1988, pp. 12–13) Cronbach (1988) and Anastasi (1986) explicitly maintained that validity theory had gotten beyond the application-specific approach, but as Moss (1992) noted: To this day, most of the popular measurement text books, like the 1985 Standards, continue to organize presentations of validity around the three-part traditional framework of construct-, content-, and criterion-related evidence. (p.232) The tension between calls for a unified framework, and the diversity inherent in different interpretations and uses had not been resolved.

General Principles Derived from Construct Validity, 1955–1989 Although the original, strong version of construct validity (Cronbach and Meehl, 1955) did not get applied much, it yielded three general principles that shaped the development of validity theory. First, effective validation requires that the proposed score interpretation and use be specified well enough that testable hypotheses can be derived from it. For a test score to have any meaning, it must make testable claims about the test taker. Second, just as scientific theories are evaluated in terms of their ability to withstand serious challenges, proposed interpretations are to be evaluated against alternate interpretations (Cronbach, 1971, 1980a, 1980b, 1988; Embretson, 1983; Messick, 1989a). Third, validation requires a program of research that investigates the claims being made and any counterclaims and their supporting assumptions, rather than a single validation study.

Messick’s Unified, but Faceted, Model, 1989 In his Educational Measurement chapter, Messick (1989a) provided a unified framework for validity based on a broadly defined version of the construct validity model. He defined validity as an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment. (p. 13; italics in original)

194 Michael Kane and Brent Bridgeman

Note two significant departures from Cronbach and Meehl’s (1955) version of construct validity. First, the definition covers actions as well as inferences, while Cronbach and Meehl focused on theory-based interpretations and not on actions. Messick was a consistent advocate for including the evaluation of consequences in validation as an “integral part of validity” (Messick, 1989a, p. 84). Second, there is no mention in the definition of a theory that defines the construct, or of nomological networks. Nomological networks are discussed by Messick (1989a), but they do not get a lot of attention. Messick (1989a) defined construct validity much more broadly than Cronbach and Meehl (1955), and took construct validity to be the evidential basis of both score uses and interpretations: The construct validity of score interpretation undergirds all score-based inferences not just those related to interpretive meaningfulness but also the content- and criterion-related inferences specific to applied decisions and actions based on test scores. (pp. 63–64) He focused on the need to provide evidence for the “trustworthiness of score interpretation in terms of explanatory concepts that account for both test performance and relationships with other variables” (Messick, 1989a, p. 34). Messick’s (1989a) unified, construct-based framework for validity had at least two significant problems. First, Messick’s presentation of his framework is dense making it difficult to apply. Messick organized the different kinds of evidence for validity in terms of a two-by-two matrix with the function of testing (interpretation or use) and the justification for testing (evidence or consequences) as the two dimensions. The table was not used much in theoretical discussion or applications of validity in part because of substantial overlap among the cells (Kane and Bridgeman, 2017). Second, there is some conflict within the framework. Messick was emphatic about the central role to be played by construct validity, but many examples of more-or-less acceptable validations do not require the strong program of construct validity. For example, Guion (1977) made the case that it would be reasonable to interpret scores on a sample of tasks drawn from a domain, as a measure of skill in the domain. Messick seemed to assume that, within the unified framework, the same construct-based methodology would be applicable to all cases. In this vein, Messick (1988) expressed concern that the comment attached to the first validity standard in the 1985 edition allowed for the use of different types of evidence in the validation of different testing programs. He maintained that under a unified approach based on the construct-validity model, validations should follow a consistent pattern; to the extent that different types of evidence are used for different testing programs, we have an application-specific framework. Messick developed his framework from the late 1950s to the late 1980s, as

The Evolution of the Concept of Validity 195

validity theory moved toward unity, while practice tended to be applicationspecific.

The Stubborn Conflict between Theoretical Unity and Applied Diversity The conflict between the theoreticians’ desire for a unified framework for validity and the recognition of the great diversity in the goals and contexts of assessment programs continued from 1955 to 1989, and as discussed later, is still with us (Sireci, 2009). Messick made an effort to unify validity under the construct model, but his elegant formulation did not resolve the conflict; in place of lists of kinds of validity, we had lists of kinds of validity evidence, with different mixes of evidence used for different applications. One major problem with basing the unified framework on the construct model is the strong association between “construct validity” and theoretical interpretations and theory-based uses; many testing application focus on practical questions (e.g., how well has the test taker mastered some domain?, how can we predict test taker’s performance in some future activity?) that do not rely much on theory.

The Role of Consequences in Validity, 1970 The achievement of the intended outcomes of testing programs has been a fundamental concern in validity theory from the early Twentieth Century, especially in criterion-related applications, but unintended, negative consequences got far less attention.

Adverse Impact as a Negative Consequence Before 1960, adverse impact across groups (racial, ethnic, gender) was not given much attention in testing, because a test was considered fair if all test takers performed the same tasks under the same conditions and were graded in the same way. Neither “bias” nor “fairness” was listed in the index for the first edition of Educational Measurement (American Council on Education, 1951), but by the mid1970s, bias had become a major concern in assessment. Messick (1982b) criticized a National Academy of Science report on ability testing, because it “evinces a pervasive institutional bias” (p. 9), by focusing on the intended outcomes of decision rules: Our traditional statistics tend to focus on the accepted group and on minimizing the number of poor performers who are accepted, with little or no attention to the rejected group or those rejected individuals who would have performed adequately if given the chance. (p. 10)

196 Michael Kane and Brent Bridgeman

The evaluation of how well score-based decisions achieve their intended outcomes was a well-established expectation, but Messick was suggesting that the applicant’s welfare also merits attention. Cronbach (1988) made a similar point about narrow inquiries that “concentrate on predicting a criterion the employer cared about” (p. 7), and neglect concerns about the applicants who are rejected. Once adverse impact got sustained attention, it was natural to think about other potential negative consequences. It was soon recognized that assessment programs can have a substantial impact on educational institutions, curricula, and students (Crooks, 1988; Frederiksen, 1984; Madaus, 1988; Lane & Stone, 2006).

Messick and Cronbach on Consequences Both Messick (1975, 1989a, 1995) and Cronbach (1971, 1980b, 1988) included the evaluation of consequences within validity, but they saw consequences as playing different roles in validity (Moss, 1992). Messick saw the evaluation of consequences as an aspect of construct validity, because the consequences of score use “both derive from and contribute to the meaning of test scores” (Messick, 1995, p.7), and negative consequences count against validity if they are due to construct-irrelevant variance or construct under-representation. For Messick, unanticipated negative consequences suggest a need for a more thorough analysis of possible sources of construct-irrelevant variance or construct under-representation, but they would not necessarily count against the validity of the scores. In contrast, Cronbach (1971, 1988) suggested that negative consequences could invalidate score use even if the consequences were not due to any problem with the assessment, because, “tests that impinge on the rights and life chances of individuals are inherently disputable” (Cronbach 1988 p. 6). Cronbach (1988) also maintained that we “may prefer to exclude reflection on consequences from meanings of the word validation, but … cannot deny the obligation” (p. 6). That is, bad consequences do not necessarily invalidate the proposed interpretation of test scores, but they do count against test use, even if the interpretation is well supported. After referencing Cronbach (1988), Messick (1989b) countered that the meaning of validation should not be considered a preference. On what can the legitimacy of the obligation to appraise social consequences be based if not on the only genuine imperative in testing, namely, validity. (p. 11) Cronbach’s insistence on evaluating all consequences probably flowed from his involvement in program evaluation. Cronbach (1982) “advocated investigating what is important, whether or not the questions fit conventional paradigms” (p. xvi). Messick (1989a) developed a general scientific framework for validity, with a primary focus on construct validity and a strong but secondary emphasis on consequences

The Evolution of the Concept of Validity 197

while Cronbach (1971, 1988) favored a more pragmatic approach, with a more direct focus on consequences. Kane (2006) tends to agree with Cronbach (1988) that negative consequences can invalidate score uses even if they are not due to any flaws in the assessment. Ironically perhaps, Messick got more criticism than Cronbach for advocating the role of consequences in validation, but arguably, Cronbach gave consequences a stronger role in the evaluation of assessment programs (Moss, 1998). Cronbach’s position was less objectionable to critics who were willing to attend to negative consequences but did not want to include them under the heading of validity.

The 1990s Consequences Debates In the 1990s, several authors (Mehrens, 1997; Popham, 1997) argued against the inclusion of consequence under validity. The critics generally agreed that consequences are relevant to the evaluation of testing programs but wanted to have validity be as objective and value-free as possible. Consequences were to be evaluated but not under the heading of validity. Others (Linn, 1997; Moss, 1998; Shepard, 1997) favored a broader conception of validity, which would include evaluations of positive and negative consequences of score use. Everyone seemed to agree that consequences should be a central concern in deciding whether to use a test in a particular way, but they disagreed about whether this concern should be addressed under the heading of validity. This debate about the role of consequences in validity theory has continued into the 21st century (Bachman and Palmer, 2010; Cizek, 2012).

Unity and Specificity, 2000–2020 Messick’s construct-based framework was unified, but it did not provide clear guidance for validation, and a number of general and specific approaches have since been developed to fill the gap between theory and practice. The general frameworks are flexible and conditional and explicitly require different kinds of evidence for different interpretations and uses. The specific models focus on particular interpretations and uses.

General Argument-Based Frameworks, 1988 Cronbach (1988) relied on principles from program evaluation (House, 1980) in developing an argument-based framework for validity: I propose here to extend to all testing the lessons from program evaluation. What House … called “the logic of evaluation argument” applies, and I invite you to think of “validity argument” rather than “validation research”. (p. 4)

198 Michael Kane and Brent Bridgeman

The validity argument would include the evidence for and against the claims inherent in the proposed interpretation and use. The argument was to “make clear, and to the extent possible, persuasive, the construction of reality and the value weightings implicit in a test and its application” (Cronbach, 1988, p.5). Kane (1992) added the idea of an interpretative argument, “with the test score as a premise and the statements and decisions involved in the interpretation as conclusions” (p.527), as a way of specifying the claims that need to be evaluated, and therefore, the kinds of evidence needed for validation. This argument-based approach also provided criteria for deciding when the interpretation and use were adequately supported, that is validated (Crooks, Kane & Cohen, 1996). If the argument were coherent and complete and its inferences were plausible, the interpretation/use could be considered valid. If any part of the argument were not plausible, the interpretation/use would not be considered valid (Haertel, 1999; Kane, 1992; Shepard, 1993). Bachman and Palmer (2010) proposed an argument-based framework for assessment development and justification that emphasized score uses and the consequences associated with score uses in terms of an Assessment Use Argument (AUA): The AUA consists of a set of claims that specify the conceptual links between a test taker’s performance, … an interpretation about the ability we want to assess, the decisions that are to be made, and the consequences of using the assessment and of the decisions that are made. (p.30) Bachman and Palmer based their framework on “the need for a clearly articulated and coherent Assessment Use Argument (AUA)” and on “the provision of evidence to support the statements in the AUA” (p.31). Following Bachman and Palmer’s work, Kane (2013) updated his terminology, by replacing “interpretive argument” by interpretation/use argument, or IUA, in order to give more emphasis to the role of uses and consequences. These argument-based frameworks are intended to retain the rigor in the strong program of construct validity, while making validation more straightforward (Cronbach, 1988; Kane, 2006; Chapelle, Enright & Jamieson, 2008; Bachman and Palmer, 2010), by making the claims to be validated explicit. One specifies the claims being made in some detail and then evaluates these claims. The inferences and assumptions would be subjected to empirical challenges, and if they survive all serious challenges, the interpretations and uses would be considered plausible, or valid. The most questionable parts of the argument should be the focus of the empirical challenges (Cronbach, 1988). The claims may involve a theory, or they may consist of a more loosely defined set of inferences. At the very least, some assumptions about the generalizability of the scores (over tasks, occasions, contexts, task formats, or time limits?)

The Evolution of the Concept of Validity 199

will be inherent in the interpretation and use of the scores. If the interpretation or use assumes that the scores will be related to other variables, these relationships can be checked. If the scores are to be used to predict some criterion, the accuracy of the predictions can be checked, but note that, if the interpretations and uses under consideration do not involve prediction, then predictive evidence is irrelevant to the validity of these interpretations and uses. Claims that are not inherent in the proposed interpretation and use of the scores can be ignored, and evidence for such irrelevant claims does not strengthen the validity argument. The argument-based frameworks are quite general and unified in that they impose the same three general requirements for validation on all testing programs: (1) specify the claims being made, (2) verify that the claims accurately represent the interpretation and use of the scores, and (3) verify that the claims are plausible by challenging them empirically. The chapter on validity in the most recent Standards (AERA, APA, NCME, 2014) is consistent with these argument-based approaches. It calls for a clear statement of the proposed interpretations and uses of the scores, and the second standard requires that: A rationale should be presented for each intended interpretation of test scores for a given use, together with a summary of the evidence and theory bearing on the intended interpretation. (AERA, APA, NCME, 2014, p.23) However, most of the discussion is organized in terms of five kinds of evidence (evidence based on test content, on response processes, on internal structure, and on relations to other variables, as well as evidence for validity and consequences of testing), and not in terms of the inferences to be evaluated.

Recent Application-specific Models for Validity As we have noted, application-specific models continued to be popular in practice, even as validity theory became more general and unified. In addition, some application-specific validity models have been put forward as definitions of validity, and in doing so, they propose to restrict the term, “validity” to specific kinds of interpretations. Borsboom, Mellenbergh & Van Heerden (2004) suggested that: a test is valid for measuring an attribute if and only if (a) the attribute exists and (b) variations in the attribute causally produce variations in the outcomes of the measurement procedure (p.1016). They define validity in terms of causal explanations and do not include uses or consequences, or any other variables in their definition (Sireci, 2016). As Holland

200 Michael Kane and Brent Bridgeman

(1986) points out, the causal impact of a trait, or construct, on assessment performances cannot be directly demonstrated empirically. The causal inference will generally have to be evaluated indirectly using the strong form of construct validity (Cronbach and Meehl, 1955). In practice, the strong form of construct validity has continued to be popular in contexts like psychological research where theory is of primary interest (e.g. Loevinger 1957; Embretson, 1983), and where explanations of performance are the main objective (Zumbo, 2009). Lissitz and Samuelsen (2007) raised questions about Messick’s (1989a) framework (particularly its complexity) and suggested a framework and terminology that focused on the content and structure of the test as a definition of “the trait of interest” (p. 441). Their examples come particularly from achievement tests and their model makes sense for this kind of interpretation. They suggested a radical simplification of the scope of validity to something like Cureton’s (1951) relevance and reliability. Mislevy and his colleagues (Mislevy, Steinberg, & Almond, 2003) proposed an Evidence Centered Design (ECD) approach to the development and evaluation of assessments that relies on formal, probability-based models (particularly Bayes nets) and reasoning based on such models. Applications of ECD start with an analysis of the construct of interest, and they use student models and task models to develop assessment tasks that would generate the kinds of evidence needed to support the intended inferences (Mislevy et al., 2003). The ECD is akin to the argument-based frameworks in that it starts with a detailed specification of the construct and then seeks to develop evidence for the claims being made. As its name suggests ECD is focused more on assessment design than on validation as such, but it clearly has implications for validity as well. More recently, Mislevy (2018) has proposed a very ambitious sociocognitive approach, which is likely to be applicable mainly in educational contexts, because it assumes fairly rich background knowledge about test takers for its full implementation. That new validation frameworks tied to particular interpretations or uses continue to be developed (Krupa, Carney, and Bostic, 2019) should not be surprising. If we adopt a model for validating assessments for some kind of use we can prescribe the kinds of evidence needed in some detail. If we adopt a unified framework for validity that is to apply to all cases (e.g., professional licensure examinations, and diagnostic assessments for the subtraction of fractions), it cannot be very prescriptive, because the use cases are so different. The more general the model, the more conditional it is likely to be. For this reason, the various editions of the Standards, which are intended to cover essentially all kinds of educational and psychological assessments, are highly contingent; most of the specific standards start with a “when” or an “if”. The argument-based approach to validity can accommodate various applicationspecific models, including the traditional content, criterion, and construct models. If the IUA focuses on level of achievement in some content domain, the validity argument would rely on evidence for content coverage and generalizability, as in Lissitz and Samuelsen’s (2007) model. If a causal (Borsboom et al., 2004) or

The Evolution of the Concept of Validity 201

explanatory (Zumbo, 2009) model is adopted, the validity argument is likely to include the strong program of construct validity (Cronbach, 1988)

Concluding Remarks The tools, models, and analytic techniques available to the validator have expanded greatly, as has the range of applications of testing programs. Before 1920, the focus was on mental abilities, but now, a wide range of assessments targeted on a variety of uses need to be validated, and it is assumed that multiple lines of evidence will be involved in the validation. The four main trends in the history of validity theory that we have traced are: first, the development of several models for validity in the first half of the last century, in particular, the content, criterion, and trait models; second, the gradual development, during the second half of the last century, of unified models that subsumed the specific models under increasingly broad conceptions of construct validity; third, the development of a clear sense of the importance of fairness and consequences in the evaluation of testing programs since the 1960s, and fourth, the development of general argument-based models that explicitly allow for variability in the kinds of evidence needed for the validation of different kinds of testing programs. Naturally, successive frameworks for validity are shaped in part by the issues that seem most pressing at the time, and by the background and interests of those who propose them. The content model was designed for achievement tests, the criterion model to validate inferences from test scores to other variables (e.g., future performance), and the construct model for measures used in clinical contexts and in scientific research. The focuses on fairness and consequences arose from the need to justify selection, placement, and licensure programs. The unified models were designed to bring order to the large set of application-specific models that were developed to address particular interpretations and uses of test scores. A second impetus to the development of new models was a dialogue between validity theorists. Messick (1989a) sought to make Cronbach’s 1971 formulation less pragmatic, and more scientific. When Michael Zieky asked Messick about the intended audience for his 1989 chapter, Messick replied, “Lee Cronbach” (Kane and Bridgeman, 2017, p.522). Cronbach’s (1988) notion of validity argument can be read as advocating a more pragmatic approach than that in Messick’s (1975, 1980) unified model, and Kane (1992, 2006, 2013) was if anything even more pragmatic. Borsboom et al. (2004) advocate a radical simplification of validity theory in reaction to the complexity of Messick’s (1989a) formulation, and Mislevy (2018) proposed a more structured approach to the development and validation of construct assessments than that provided by Messick. This conversation between practice and theory and among theorists will go on as new applications and problems arise and as old ones are revisited, and validity theorists are likely to be arguing about issues of bias, fairness, and consequences, as long as test scores are used to make life-altering decisions.

202 Michael Kane and Brent Bridgeman

Note 1 The authors wish to thank Suzanne Lane and Stephen Sireci for their review and helpful comments on an earlier draft of this chapter.

References American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (1985). Standards for educational and psychological testing. Washington, DC: American Psychological Association. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (1999). Standards for educational and psychological testing. Washington, DC: American Psychological Association. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. American Psychological Association (1954). Technical recommendations for psychological tests and diagnostic techniques. Psychological Bulletin Supplement, 51, 2, 1–38. American Psychological Association, American Educational Research Association, and National Council on Measurement in Education (1966). Standards for Educational and Psychological Tests and Manuals. Washington, DC: American Psychological Association. American Psychological Association, American Educational Research Association, and National Council on Measurement in Education (1974). Standards for Educational and Psychological Tests and Manuals. Washington, DC: American Psychological Association. Anastasi, A. (1950). The concept of validity in the interpretation of test scores. Educational and Psychological Measurement, 10(1), 67–78. Anastasi, A. (1986). Evolving concepts of test validation. Annual Review of Psychology, 37, 1–15. Angoff, W. H. (1988). Validity: An evolving concept. In H. Wainer & H. Braun (Eds.), Test Validity (pp. 9–13). Hillsdale, NJ: Lawrence Erlbaum. Bachman, L. & Palmer, A. (2010). Language assessment in practice: Developing language assessments and justifying their use in the real world. Oxford: Oxford University Press. Borsboom, D., Mellenbergh, G. J., & Van Heerden, J. (2004). The concept of validity. Psychological Review, 111, 1061–1071. Brennan, R. (2001a). An Essay on the history and future of reliability from the perspective of replications. JEM, 38(4), 295–317. Brennan, R. (2001b). Generalizability theory. New York, NY: Springer-Verlag. Campbell, D. T. (1960). Recommendations for APA test standards regarding construct, trait, or discriminant validity. American Psychologist, 15(8), 546–553. Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81–105. Chapelle, C. A., Enright, M. K., & Jamieson, J. (Eds.), (2008). Building a validity argument for the test of English as a foreign language. New York, NY: Routledge. Cizek, G. (2012). Defining and distinguishing validity: Interpretations of score meaning and justifications of test use. Psychological Methods, 17(1), 31–43. Cook, T. & Campbell, D. (1979). Quasi-experimentation: Design and analysis issues for field settings. Boston, MA: Houghton Mifflin. Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.), Educational measurement, 2nd ed. (pp. 443–507). Washington, DC: American Council on Education.

The Evolution of the Concept of Validity 203

Cronbach, L. J. (1980a). Validity on parole: How can we go straight? New directions for testing and measurement: Measuring achievement over a decade. Proceedings of the 1979 ETS Invitational Conference (pp. 99–108). San Francisco, CA: Jossey-Bass. Cronbach, L. J. (1980b). Selection theory for a political world. Public Personnel Management, 9(1), 37–50. Cronbach, L. J. (1982). Designing evaluations of educational and social programs. San Francisco, CA: Jossey-Bass. Cronbach, L. J. (1988). Five perspectives on the validity argument. In H. Wainer & H. Braun (Eds.), Test validity (pp. 3–17). Hillsdale, NJ: Lawrence Erlbaum. Cronbach, L. J. (1989). Construct validation after thirty years. In R. E. Linn (Ed.), Intelligence: Measurement, theory, and public policy (pp. 147–171). Urbana, IL: University of Illinois Press. Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York, NY: Wiley. Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302. Crooks, T. J. (1988). The impact of classroom evaluation practices on students. Review of Educational Research, 58, 438–481. Crooks, T., Kane, M. & Cohen, A. (1996). Threats to the valid use of assessments. Assessment in Education, 3, 265–285. Cureton, E. E. (1951). Validity. In E.F. Lingquist (Ed.), Educational measurement. Washington, DC: American Council on Education. Ebel, R. (1961). Must all tests be valid? American Psychologist, 16, 640–647. Embretson (Whitely), S. (1983). Construct validity: Construct representation versus nomothetic span. Psychological Bulletin, 93, 179–197. Equal Employment Opportunity Commission (EEOC), Civil Service Commission, Department of Labor, and Department of Justice (1979). Adoption by four agencies of Uniform Guidelines on Employee Selection Procedures. Federal Register, 43, 38290– 38315. Fredericksen, N. (1984). The real test bias: Influences of testing on teaching and learning. American Psychologist, 39, 193–202. Guion, R. (1977). Content validity: The source of my discontent. Applied Psychological Measurement, 1, 1–10. Guion, R. (1980). On trinitarian conceptions of validity. Professional Psychology, 11, 385–398. Guion, R. (1998). Assessment, measurement, and prediction for personnel decisions. Mahwah, NJ: Erlbaum. Gulliksen H. (1950) Theory of mental tests. New York, NY: Wiley. Republished 1987 by Lawrence Erlbaum, Hillsdale, NJ. Haertel, E. H. (1999). Validity arguments for high-stakes testing: In search of the evidence. Educational Measurement: Issues and Practice, 18(4), 5–9. Holland, P. W. (1986). Statistics and causal inference. Journal of the American Statistical Association, 81, 945–960. House, E. R. (1980). Evaluating with validity. Beverly Hills, CA: Sage Publications. Jöreskog, K. (1973). A general method for investigating a linear structural equation system. In A. Goldberger & D. Duncan (Eds.), Structural equation models in the social sciences (pp. 85–112). New York, NY: Academic Press. Kane, M. (2006). Validation. In R. Brennan (Ed.), Educational measurement (4th ed., pp. 17–64), Westport, CT: American Council on Education and Praeger.

204 Michael Kane and Brent Bridgeman

Kane, M. (1992). An argument-based approach to validation. Psychological Bulletin, 112, 527–535. Kane, M. (2013). Validating the Interpretations and Uses of Assessment Scores. Journal of Educational Measurement, 50, 1–73. Kane, M. T. (1982). A sampling model for validity. Applied Psychological Measurement, 6, 125–160. Kane, M. T., Crooks T. J., & Cohen, A. S., (1999). Validating measures of performance. Educational Measurement: Issues and Practice, 18(2), 5–17. Kane M. & Bridgeman B. (2017). Research on Validity Theory and Practice at ETS. In R. Bennett and M. von Davier (Eds.), Advancing Human Assessment. Methodology of Educational Measurement and Assessment. Cham: Springer. https://doi.org/10.1007/ 978-3-319-58689-2_16. Kelley, T. (1927). Interpretation of educational measurements. Yonkers, NY: World Book. Krupa, E., Carney, M. & Bostic, J. (2019). Argument-based validation in practice: Examples from mathematics education. Applied Measurement in Education, 32, 1–9. Lane, S., Parke, C., & Stone, C. (1998). A framework for evaluating the consequences of assessment programs. Educational Measurement: Issues and Practice, 17(2), 24–28. Lane, S. & Stone, C. (2006). Performance assessment. In R. Brennan (Ed.), Educational measurement (4th ed., pp. 387–431), Westport, CT: American Council on Education and Praeger. Linn R. L. (1997). Evaluating the validity of assessments: The consequences of use. Educational Measurement: Issues and Practice, 16(2), 14–16. Lissitz, R., & Samuelsen, K. (2007). A suggested change in terminology and emphasis regarding validity and education. Educational Researcher, 36, 437–448. Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological Reports, Monograph Supplement, 3, 635–694. Lord, F.M. (1980). Applications of item response theory to practical testing problems. Mahwah, NJ: Lawrence Erlbaum Associates. Lord, F., & Novick, M. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. McDonald, R. P. (1985). Factor analysis and related methods. Hillsdale, NJ: Erlbaum. Madaus, G. F. (1988). The influences of testing on the curriculum. In L. N. Tanner (Ed.), Critical issues in curriculum (pp. 83–121). Chicago, IL: University of Chicago Press. Markus, K. & Borsboom, D. (2013). Frontiers of test validity theory; measurement, causation, and meaning. New York, NY: Routledge. Mehrens, W. A. (1997). The consequences of consequential validity. Educational Measurement: Issues and Practice, 16(2), 16–18. Messick, S. (1975). The standard problem: Meaning and values in measurement and evaluation. American Psychologist, 30, 955–966. Messick, S. (1980). Test validity and the ethics of assessment. American Psychologist, 35, 1012–1027. Messick, S. (1982). The values of ability testing: Implications of multiple perspectives about criteria and standards. Educational Measurement: Issues and Practice, 1(3), 9–12, 20. Messick, S. (1988). The once and future issues of validity. Assessing the meaning and consequences of measurement. In H. Wainer & H. Braun (Eds.), Test validity (pp. 33– 45). Hillsdale, NJ: Erlbaum. Messick, S. (1989a). Validity. In R. L. Linn (Ed.), Educational Measurement, 3rd ed. (pp. 13– 103) New York, NY: American Council on Education and Macmillan.

The Evolution of the Concept of Validity 205

Messick, S. (1989b). Meaning and values in test validation: The science and ethics of assessment. Educational Reseacher, 18(2), 5–11. Messick, S. (1995). Standards of validity and the validity of and standards in performance assessment. Educational Measurement: Issues and Practice, 14(4), 5–8. Mislevy, R., Steinberg, L., & Almond, R. (2003). On the structure of educational assessments. Measurement: Interdisciplinary Research and Perspectives, 1, 3–62. Mislevy, R. (2018). Sociocognitive foundations of educational measurement. New York, NY: Rutledge. Moss, P. (1992). Shifting conceptions of validity in educational measurement: Implications for performance assessment. Review of Educational Research, 62, 229–258. Moss, P. (1995). Themes and variations in validity theory. Educational Measurement: Issues and Practice, 4(2), 5–13. Moss, P.A. (1998). The role of consequences in validity theory. Educational Measurement: Issues and Practice, 17(2), 6–12. Newton, P. E., & Shaw, S. D. (2013). Standards for talking and thinking about validity. Psychological Methods, 18(3), 301–319. Popham, W.J. (1997) Consequential validity: Right concern – wrong concept. Educational Measurement: Issues and Practice, 16(2), 9–13. Ryans, D. G. & Frederiksen, N. (1951). Performance tests of educational achievement. In E. F. Lindquist (Ed.), Educational measurement (pp. 455–494). Washington, DC: American Council on Education. Rulon, P. J. (1946). On the validity of educational tests. Harvard Educational Review, 16, 290–296. Shepard, L. A. (1993). Evaluating test validity. In L. Darling-Hammond (Ed.), Review of Research in Education, Vol. 19 (pp. 405–450). Washington, DC: American Educational Research Association. Shepard, L. A. (1997). The centrality of test use and consequences for test validity. Educational Measurement: Issues and Practice, 16(2), 5–24. Sireci, S. G. (1998). The construct of content validity. Social Indicators Research, 45, 83–117. Sireci, S. G. (2009). Packing and unpacking sources of validity evidence: History repeats itself again. In R. Lissitz (Ed.), The concept of validity (pp. 19–38). Charlotte, NC: Information Age Publishers. Sireci, S. G. (2016). On the validity of useless tests. Assessment in Education: Principles, Policies, and Practice, 23, 226–235. Spearman, C. (1904). “General intelligence” objectively determined and measured. American Journal of Psychology, 15, 201–292. Suppe, F. (1977). The structure of scientific theories. Urbana, IL: University of Illinois Press. Thorndike, E. L. (1918). Individual differences. Psychological Bulletin, 15, 148–159. Zumbo, B. D. (2009). Validity as contextualized and pragmatic explanation, and its implications for validation practice. In R. Lissitz (Ed.), The concept of validity (pp. 65–82). Charlotte, NC: Information Age Publishers. Zwick, R. (2006). Higher education admission testing. In R. Brennan (Ed.), Educational measurement (4th ed., pp. 647–679), Westport, CT: American Council on Education and Praeger.

10 GENERALIZABILITY THEORY Robert L. Brennan1

Generalizability (G) is a theory that is principally associated with behavioral measurements—particularly their sources of error. As such G theory is intimately related to reliability and validity issues. G theory is usually viewed as beginning with a monograph by Cronbach, Gleser, Nanda, and Rajaratnam (1972) entitled The Dependability of Behavioral Measurements: Theory of Generalizability for Scores and Profiles. Notably, however, these authors published earlier papers in the 1960s that framed much of the theory, and prior to 1960 numerous authors (many cited later) laid the foundation for various aspects of G theory. Perhaps because Cronbach et al. (1972) has four authors, sometimes the contributions of Gleser, Nanda, and Rajaratnam are overlooked. Although it is virtually impossible to disentangle contributions, Gleser surely played an important role in developing the framework for G theory, while Nanda and Rajaratnam were more statistically focused. Since the publication of the Cronbach et al. (1972) monograph, numerous authors have made contributions to G theory. Largely, these contributions clarified aspects of the theory, enhanced it in various ways, made it more accessible to researchers and practitioners, and/or illustrated the applicability of G theory in various disciplines. Many of the notable G theory extensions since 1972 are not uniquely novel; often, they were foreshadowed by Cronbach et al. (1972). In this sense, Cronbach et al. (1972) has provided the framework for a large share of the G theory literature since 1972, with some noteworthy exceptions. An indispensable starting point for a history of G theory is the preface and parts of the first chapter of Cronbach et al. (1972) . In addition, Cronbach (1976, 1989, 1991) offers numerous perspectives on G theory and its history. Cronbach (1991) is particularly rich with first-person reflections. Brennan (1997) provided his perspective on some historical aspects of the first 25 years of G theory. Other

Generalizability Theory 207

overviews of G theory history are cited later. In his last publication, Cronbach (2004) offered some thoughts on his current thinking about G theory. This chapter begins with a brief overview of two perspectives on G theory: 1) G theory as an extension of classical test theory; and 2) the conceptual framework of G theory, with the express purpose of informing the reader about G theory’s scope, depth, terminology, and especially concepts. This overview is followed by three sections covering G theory contributions in successive, approximately 20year time spans. The references cited herein are illustrative of the history of G theory and reflective of those who have written about it. The cited references are not an exhaustive set. (Brennan, 2020, provides over 300 references.) Occasionally, the author provides first-person accounts of his own experience with G theory. These comments and, indeed, the entire chapter reflect the author’s perspective, only. Different authors might have legitimate, different perspectives.

An Overview of G Theory G theory is very powerful, but the power is purchased at the price of conceptual complexities. An understanding of the history of G theory, therefore, requires a reasonably good grasp of its lineage and conceptual framework. That is the purpose of this section. (See Brennan, 2010a, b for more extended introductions.) Two of the most common statements about G theory are that: 1) G theory is an extension of classical test theory (CTT); and 2) G theory is the application of analysis of variance (ANOVA) to measurement issues. Although there is an element of truth in these assertions, both of them (especially the second) fail to capture the essence of G theory and its widespread applicability to measurement issues in numerous disciplines.

G Theory as an Extension of CTT CTT begins with the simple assertion that observed scores (X) equal true scores (T ) plus error scores (E) (i.e., X ¼ T þ E). Then, CTT assumes that the expected value of E is zero and that T and E are uncorrelated. In addition, CTT has a lengthy list of assumptions that might be made about parallelism of test forms, each of which potentially leads to different results (see Haertel, 2006, and Feldt & Brennan, 1989). CTT has a long and distinguished history, and it is not likely to be completely replaced by any other psychometric model. (See Clauser [this volume] for a discussion of the history of CTT.) Still CTT has some limitations. In particular, all errors are clumped into a single undifferentiated term E, whereas it seems intuitively clear that errors can arise from multiple sources (e.g., sampling of tasks, raters, etc.). Univariate G theory (UGT), by contrast, embodies a conceptual framework and statistical

208 Robert L. Brennan

procedures that permit decomposing E into multiple, random component parts; i.e., E ¼ E1 þ E2 þ    þ Ek , where k is under the control of the investigator. Furthermore, UGT distinguishes between two different types of error that are discussed later: relative error() and absolute error (). Note that UGT is essentially a random effects theory. Importantly, CTT does not model items per se. By contrast, G theory assumes that items are a random facet with an associated variance component that is conceptually related to the variability of item difficulty levels. The G theory modeling of items (and any other “main” effect such as raters, prompts, etc.) has many, important consequences that differentiate it from CTT. Multivariate G theory (MGT) extends UGT by decomposing the undifferentiated true score of CTT into m fixed component parts (T ¼ T1 þ    þ Tm ). In some circumstances, it is possible to employ UGT with a mixed model in which a facet is fixed, but doing so can be statistically complicated, especially for unbalanced designs. By contrast MGT is specifically designed to handle one or more fixed facets in a theoretically appealing and relatively straightforward manner. Accordingly, it is sometimes stated that MGT is the “whole” of G theory, with random effects models within each one of the levels of one or more fixed facets.

The Conceptual Framework of G Theory With the caveats noted previously, there is an element of truth in characterizing CTT as a G theory parent, in a measurement sense. By contrast, ANOVA is primarily a statistical parent in that univariate G theory involves random effects models and focuses on the estimation of variance components. Note that F tests have no role in G theory. Rather, from a statistical perspective, it is the magnitude of estimated variance components that are of interest. G theory itself, however, is considerably different from the simple conjunction of its parents, CTT and ANOVA. In particular, conceptual issues which are obscure in CTT and ANOVA are important in G theory. These matters are discussed next in the context of a hypothetical, performance testing example.

Universe of Admissible Observations Consider, for example, a very large (potential) set of essay prompts (t) that might be used in a testing program, as well as a very large set of raters (r) who might evaluate responses to these prompts. Assume, as well, that any rater might (in principle) evaluate any response to any prompt. In this case, we would say that the universe of admissible observations (UAO) consists of two facets (t and r) that are crossed; hence, the UAO is denoted t  r. Presumably, there exists some well-defined set of persons (p) about whom we wish to draw conclusions;

Generalizability Theory 209

they are called the “objects of measurement.” (In most circumstances, persons are indeed the objects of measurement “facet”, but G theory permits any facet to have the status of “objects of measurement.”)

G Study and Estimated Variance Components A G study is designed to estimate variance components for the UAO. Using the above example, a G study might be based on a sample of t and r from the UAO, and a sample of p from the population. One possible G study design is p  t  r in which each of np persons respond to nt prompts, with each person-prompt combination evaluated by nr raters. Let us suppose that an investigator, Mary, conducts a G study using the p  t  r design and obtains the following seven estimated variance components: ^2 ð pÞ ¼ :25; ^2 ðtÞ ¼ :06; ^2 ðrÞ ¼ :02; ^2 ðptÞ ¼ :15; ^2 ðprÞ ¼ :04; ^2 ðtrÞ ¼ :00; and ^2 ðptr Þ ¼ :12:

Universe of Generalization and D Study The G study variance components for the UAO are for single persons, prompts, and raters. By contrast, a universe of generalization (UG) for this UAO is a universe of randomly parallel forms, with scores for each form being mean scores for n0t prompts evaluated by n0r raters from the UAO. Furthermore, the structure of the D study need not mirror that of the G study. Mary, the reearcher analyzing this data set, notes that ^2 ðrÞ and ^2 ðtrÞ are relatively small, suggesting that a small number of raters per prompt may be sufficient for adequate measurement precision. By contrast, ^2 ðtÞ and ^2 ðptÞ are relatively large suggesting that it is desirable to use as many prompts as possible in a D study. Nonetheless, available testing time constrains how many prompts Mary can use, although she has access to a relatively large number of raters. Accordingly, Mary decides that a more efficient D study design for her purposes will be a p  ðR : T Þ, where the colon denotes nesting of raters within prompts, and uppercase letters designate mean scores over nt0 ¼ 3 prompts and nr0 ¼ 2 raters per prompt. For her D study design and sample sizes, the estimated variance components are easily obtained from the G study (see, for example, Brennan, 1992b) as follows: ^2 ð pÞ ¼ :25; ^2 ðT Þ ¼ :06=3 ¼ :02; ^2 ðpT Þ ¼ :15=3 ¼ :05; ^2 ðR : T Þ ¼ ð:02 þ :00Þ=6 ¼ :003; and ^2 ðpR : T Þ ¼ ð:04 þ :12 Þ=6 ¼ :027 ; where ^2 ð pÞ ¼ :25 is estimated universe score variance.

210 Robert L. Brennan

D Study Error Variances The single most important outcome of a D study is the estimation of error variance, or its square root called a “standard error of measurement” (SEM). As Kane (2011) states, “Errors don’t exist in our data (p. 12).” From the perspective of G theory, error variance is dependent upon the investigator’s specification of a UG and the associated D study variance components. The most natural definition of error variance in G theory is the expected value of the difference between universe scores and observed scores, which is called absolute error variance, denoted 2 ðÞ. For Mary’s UG and D study, ^2 ðÞ is simply the sum of all the D study estimated variance components except ^2 ð pÞ; i.e., ^2 ðÞ ¼ :02 þ :05 þ :003 þ :027 ¼ :1 for person mean scores over three prompts and two raters per prompt. The square root, ^ðÞ ¼ :316, is the absolute-error SEM. Is this large or small? One way to answer this question is to compute an error/tolerance ratio, E=T (see Kane, 1996) using ^ðÞ as error. Defining pffiffiffiffiffiffiffi tolerance is the investigator’s responsibility. If tolerance is defined as ^ð pÞ ¼ :25 ¼ :5, then E=T ¼ ^ðÞ=^ ð pÞ ¼ :316=:5 ¼ :632; which means that ^ðÞ is a little over 60% as large as ^ð pÞ. A corresponding reliability-like coefficient is ^¼ 

^2 ð pÞ ¼ :714: ^2 ð pÞ þ ^2 ðÞ

Relative error variance, 2 ðÞ, is closely associated with CTT. For Mary’s UG and D study ^2 ðÞ ¼ ^2 ðpT Þ þ ^2 ðpR : T Þ ¼ :05 þ :027 ¼ :077; and the relative-error SEM is ^ðÞ ¼ :277. Assuming tolerance is defined ^ð pÞ, the error/tolerance ratio is E=T ¼ ^ðÞ=^ ð pÞ ¼ :277=:5 ¼ :555; which means that ^ðÞ is about 56% as large as ^ð pÞ. The corresponding generalizability coefficient is ^2 ¼

^2 ð pÞ ¼ :765: ^2 ð pÞ þ ^2 ðÞ

Generalizability Theory 211

Multivariate G Theory The basic issues discussed above for UGT apply to MGT, as well. An important difference, however, is that variance components in UGT are replaced by variance-covariance matrices in MGT. Suppose, for example, that the UAO consists of two fixed types of essay prompts [say, narrative (a) and persuasive (b)] and two corresponding sets of raters. Suppose, as well, that the same population of persons respond to both types of prompts. A possible G study might be p  ðr : t Þ, which would give estimates of variance components for p, t, r:t, pt, and pr:t, for both types of prompts. In addition, this design would give an estimate of the covariance component ab ð pÞ for persons responding to both types of prompts (the fixed facet). In the notation for this design [p  ðr : t Þ], the bullet superscript (p ) indicates that the person matrix is full (i.e., contains the covariance component). The open-circle superscripts indicate that t and r are specific to prompt types, without associated covariance components. The corresponding D study design would be p  ðR : T  Þ . It is very rare for any application of G theory to make use of all of the distinctions incorporated in the theory. Indeed, the so-called “protean quality” of G theory (Cronbach, 1976, p.199) tends to make any single application take on a life of its own.

Precursors to G Theory In discussing the genesis of G theory, Cronbach (1991) states: In 1957 I obtained funds from the National Institute of Mental Health to produce, with Gleser’s collaboration, a kind of handbook of measurement theory…. Since reliability has been studied thoroughly and is now understood, I suggested to the team, “let us devote our first few weeks to outlining that section of the handbook, to get a feel for the undertaking.” We learned humility the hard way, the enterprise never got past that topic. Not until 1972 did the book appear … that exhausted our findings on reliability reinterpreted as generalizability. Even then, we did not exhaust the topic. When we tried initially to summarize prominent, seemingly transparent, convincingly argued papers on test reliability, the messages conflicted. (pp. 391–392) To resolve these conflicts, Cronbach and his colleagues devised a rich conceptual framework and married it to analysis of random effects variance components. The net effect is “a tapestry that interweaves ideas from at least two dozen authors” (Cronbach, 1991, p. 394) who published their work primariliy in the 1940s and 1950s.

212 Robert L. Brennan

The statistical machinery employed in G theory has its genesis in Fisher’s (1925) work on ANOVA. However, the estimation of random effects variance components was not researched intently until the late 1940s (see, for example, Crump, 1946, and Eisenhart, 1947). This research was brought to Cronbach’s attention by a graduate student, Milton Meux, about 1957 (L. J. Cronbach, personal communication, April 18, 1997) at approximately the same time that Cornfield and Tukey (1956) published their rules for expressing expected mean square equations in terms of variance components. In this sense, the primary statistical machinery for G theory was in place (at least somewhat) before the conceptual framework was fully specified. By 1950 there was a rich literature on reliability from the perspective of CTT. Most of this literature had been superbly summarized by Gulliksen (1950), which included chapters on experimental methods for estimating reliability, as well as reliability estimated by item homogeneity—what came to be called internal consistency estimates. Such estimates included Hoyt’s (1941) ANOVA version of Kuder and Richardson’s (1937) KR20 index. Hoyt, however, was not the first researcher to apply ANOVA to measurement problems. An earlier contribution was made by Burt (1936) in his treatment of the analysis of examination marks. Gulliksen’s (1950) book was published before Cronbach’s widely-cited 1951 paper that introduced Coefficient . For the next several years a great deal of research on reliability formed the backdrop for G theory. Finlayson’s (1951) study of grades assigned to essays was probably the first treatment of reliability in terms of variance components. Shortly thereafter Pilliner (1952) provided theoretical relations between intraclass correlations and ANOVA (see also Haggard, 1958). Long before G theory was formally conceived, Cronbach (1947) had expressed the concern that some type of multifacet analysis was needed to resolve inconsistencies in some estimates of reliability. In the 1950s various researchers began to exploit the fact that ANOVA could handle multiple facets simultaneously. Particular examples include Loveland’s (1952) doctoral dissertation, work by Medley, Mitzel, and Doi (1956) on classroom observations, and Burt’s (1955) treatment of test reliability estimated by ANOVA. Importantly, Lindquist (1953, chap. 16) laid out an extensive exposition of multifacet theory that focused on the estimation of variance components in reliability studies. Lindquist demonstrated that multifacet analyses lead to alternative definitions of error and reliability coefficients. Lindquist’s chapter clearly foreshadowed important parts of G theory.

Validity Issues Motivate G theory Cronbach was on the faculty at the University of Chicago from 1946 to 1948. He recalls that: Five minutes with Joseph Schwab had a profound influence….In some context Schwab remarked that biologists have to decide what to count as a

Generalizability Theory 213

species…. Schwab was acute enough to catch my flicker of surprise and force home the idea of scientist as construer rather than as discoverer of categories the Creator had in mind. That conversation … resonates in my thinking to this day. (Cronbach, 1989; p. 72, italics added) Given this perspective, it is not surprising that G theory requires that investigators define the conditions of measurement of interest to them. The theory emphatically disavows any notion of there being a “correct” set of conditions of measurement. Also, from the perspective of G theory, the particular tasks or items represented in an a priori dataset are not a sufficient specification of a measurement procedure. These notions are central to the conceptual framework of G theory, but they are not entirely novel. Guttman, for example, once made the provocative remark that a test belongs to several sets, and therefore has several reliabilities. “List as many 4-letter words that begin with t as you can.” That word-fluency task fits into at least three families: 4-letter words beginning with a specified letter, t words of a specified length, and 4-letter words with t in a specified position. The investigator’s theory, rather than an abstract concept of truth and error, determines which family contains tests that “measure the same variable” (Cronbach, 1991, p. 394).

Absolute Error and Relative Error In retrospect, most of the reliability literature in the 1940s and 1950s was rather confusing (and sometimes seemingly contradictory) for at least three reasons. First, it focused on estimates of reliability without always clearly specifying what constitutes a replication of the measurement procedure (see, Brennan, 2001a). Second, the literature focused more on correlationally-based coefficients than the more fundamental matter of error variance. Third, the CTT model, with its single error term, was incapable of directly distinguishing between different types of error that arise depending upon how error is defined and designs are specified. In 1951 Ebel published a paper on the reliability of ratings in which he essentially considered two types of error variance—one that included, and another that excluded, rater main effects. It wasn’t until G theory was fully formulated that the issues Ebel grappled with were truly clarified in the distinction between relative () and absolute () error for various designs. Very much the same problems were considered by Lord (1955, 1957, 1959) in a classic series of papers about conditional standard errors of measurement (CSEMs) and reliability under the assumptions of the binomial error model (Lord, 1962). The issues Lord was grappling with had a clear influence on the development of G theory. According to Cronbach (personal communication, 1996), about 1957 Lord visited the Cronbach team in Urbana. Their discussions

214 Robert L. Brennan

suggested that the error in Lord’s formulation of the binomial error model for an individual person could not be the same error as that in CTT for a crossed design. This very important insight was incorporated in G theory through the distinction between relative and absolute error. Strictly speaking, in G theory absolute error  is the most direct notion of error in that is the difference between observed and universe scores. By contrast relative error  error (the difference between two deviation scores) provides a kind of bridge between G theory and CTT.

Coefficient Alpha Cronbach et al. (1972) make very little explicit reference to the Coefficient  paper (Cronbach, 1951). That paper, however, seems to have anticipated at least some aspects of G theory. First, there are indications that Cronbach anticipated writing subsequent papers involving different coefficients (presumably to be named , , etc.). The development of G theory rendered such papers superfluous. Second, Cronbach (1951) takes issue with the conventional wisdom (formalized subsequently by Lord & Novick, 1968) that  is a lower limit to reliablilty. From the perspective of G theory, Coefficient  can be an upper limit to reliability if, for example, the intended UG is “larger” than the actual design used for data collection. Clearly, Cronbach’s conceptual framework for  foreshadowed the distinction between a data collection design and an intended universe, an important feature of G theory. Third, commenting on random sampling in G theory and Coeffcient , Cronbach (2004) states: Only one event in the early 1950s influenced my thinking: Frederick Lord’s (1955) article in which he introduced the concept of randomly parallel tests (p. 393)…. It was not until Lord’s (1955) explicit formulation of the idea of randomly parallel tests that we began to write generally about the sampling, not only of persons, but of items (p. 401)…. My 1951 article embodied the randomly parallel-test concept …, but only in indefinite language. (p. 402) Fourth, the conceptual framework of Cronbach (1951) challenged the ubiquitous assumption of unidimensionality. Specifically, the paper “cleared the air by getting rid of the assumption that the items of a test were unidimensional (Cronbach, 2004, p. 397). Importantly, in his last published paper Cronbach (2004) concluded: I no longer regard the alpha formula as the most appropriate way to examine most data. Over the years, my associates and I developed the complex generalizability (G) theory. (p. 403)

Generalizability Theory 215

In short, the fact that Coefficient  is historically important, well known, and widely used does not necessarily mean that it is a defensible estimate of reliability in all (or even many) contexts.

The 1960s and 1970s The genius of Cronbach and his colleagues was their creation of a conceptual framework and use of a methodology (variance components analysis) that integrated the contributions of numerous researchers, even when some contributions seemed to conflict with each other. The essential features of univariate G theory were largely completed with technical reports in 1960-61 that were revised into three journal articles, each with a different first author (Cronbach, Rajaratnam, & Gleser,1963; Gleser, Cronbach, & Rajaratnam, 1965; and Rajaratnam, Cronbach, & Gleser 1965). In 1964 Cronbach moved to Stanford. Shortly thereafter Harinder Nanda’s studies on interbattery reliability provided part of the motivation for the development of MGT. At about the same time, Cronbach, Schönemann, P., & McKie (1965) published their paper on stratified alpha which was, in effect, a multivariate generalizability coefficient. Extending UGT to MGT was a huge undertaking that was uniquely attributable to the Cronbach team. This is surely one reason it took more that 10 years after the 1960-61 reports for Cronbach et al. (1972) to appear in print.

My Entré into G Theory Many investigators who use G theory in their research do so only after concluding that more conventional approaches seem inadequate. That was indeed the motivation that lead me to G theory. In the late 1960s and early 1970s I served as one of many consultants on evaluations of various federal programs including the National Day Care Study. A distinguishing common characteristic of these studies was that the treatments were applied to whole entities (e.g., classrooms) and evaluated using certain measurement procedures. A very natural question to ask was, “How shall we estimate the reliability of classroom mean scores for these measurement procedures?” A number of discussions convinced virtually all of us that the problem was not getting an estimate; rather the problem was that we had too many estimates (specifically three), and no principled way to choose among them! With brash confidence, I set aside the summer of 1972 to resolve this paradox. It did not take that long, primarily because my luck exceeded my confidence! The library at SUNY at Stony Brook where I was a beginning assistant professor had a copy of the newly published Cronbach et al. (1972) book. After studying it night and day for a week, the answer was obvious. The three different estimates were related to different universes of generalization when class means were the objects of measurement—specifically, (1) persons random and items fixed, (2) persons fixed and items

216 Robert L. Brennan

random, and (3) both persons and items random. This “insight” (which was actually “baked into” G theory) eventually led to my first publication on G theory (Brennan, 1975). Shortly thereafter Michael Kane joined the faculty of education at Stony Brook, and I discovered that he and some of his former colleagues at the University of Illinois had been working on exactly the same problem in the context of student evaluations of teaching (see, for example, Kane, Gillmore, & Crooks, 1976, as well as Gillmore, Kane, and Naccarato, 1978). Our common interest in this problem led to a joint paper (Kane & Brennan, 1977). The lessons I learned in 1972 were: 1) the central importance of clearly identifying the facets of interest and 2) specifying which facets are random and which are fixed. These lessons profoundly influenced virtually all of my subsequent thinking about measurement.

Domain-referenced Testing At the same time that Kane and I were working on our class means paper we were intrigued with the idea of using G theory to address issues surrounding the reliability of domain-referenced (originally called criterion-referenced or mastery) scores. This was a very hot topic in the 1970s, and remains relevant to this day. Our basic idea (Brennan & Kane, 1977a,b) was based on using  rather than  in defining error variances and coefficients (or indices). Specifically, G coefficients (E2 ) incorporate relative error variance, 2 ðÞ, and have a correlational interpretation. By contrast, our indices of dependability () incorporate absolute error variance, 2 ðÞ, and do not usually have a correlational interpretation. (This work was later summarized and somewhat extended by Brennan, 1984). The research that Kane and I did on domain-referenced scores and class means was so clearly co-equal that we flipped a coin to decide on first authorship. To follow blindly the alphabetize-by-last-name convention would have grossly misrepresented our relative contributions.

Other Research In retrospect, the 1970s was a decade in which researchers and practitioners were primarily learning about G theory by studying Cronbach et al. (1972), and beginning to apply it, primarily in the field of education and, to an extent, psychology. Some of this research extended into subsequent decades. Cronbach (1976) commented on using G theory in the design of educational measures. Joe and Woodward (1976) considered somewhat controversial approaches to maximizing MGT coefficients. Sirotnik and Wellington (1977) and Boodoo (1982) considered using incidence sampling as a methodology to estimate variance components. Brennan and Lockwood (1980) considered the use of G theory to examine the dependability of certain standard setting procedures, a

Generalizability Theory 217

topic that continues to be studied to this day. Another enduring topic is the evaluation of classroom teaching (e.g., Erlich & Borich, 1979; Erlich & Shavelson, 1976). Cronbach et al. (1972) noted that estimated variance components are themselves subject to sampling error, and they suggested using the jackknife as a procedure to address the problem. Smith (1978, 1982) also considered the variability of estimated variance components, but, with minor exceptions, this challenging topic was largely ignored until the turn of the century.

The 1980s and 1990s The last two decades of the 20th century are particularly noteworthy for efforts to popularize G theory and to extend some aspects of the theory. In addition, the 1980s and 1990s witnessed many applications of G theory in numerous contexts.

Efforts to make G Theory more Accessible Cronbach et al. (1972) recognized that their G theory book was “complexly organized and by no means simple to follow (p. 3)” By the early 1980s I decided to try to teach a simplified version of G theory for graduate students and measurement practitioners. With the assistance of several colleagues, I began the first of nearly 20 every-other-year G theory training sessions for the AERA and NCME Annual Conventions. My first effort at writing a simpler treatment of G theory (Brennan, 1977) was a paper that was rejected by a major journal; the editor described the paper as being “too propaedeutic.” Shortly thereafter Jay Millman, who was then President of NCME, asked me to write a monograph on G theory for publication by NCME. I agreed, but when I completed the monograph almost three years later, NCME was no longer interested in publishing Elements of Generalizability Theory. ACT, however, did publish it (Brennan, 1983, 1992a). I had long felt that a simpler treatment of G theory was not enough to get the theory used more widely. Researchers and practitioners needed a computer program. So, I designed a program called GENOVA (Crick & Brennan, 1983) that was coordinated with Elements of Generalizability Theory. At that time, however, my computer skills were not adequate for programming GENOVA. That task was undertaken by Joe Crick, a colleague of mine from graduate school at Harvard. GENOVA was written in Fortran and coded in such a way that it could quickly process virtually unlimited amounts of data using very little core, which was necessary given the limitations of mainframe computers available at that time. Shavelson and Webb (1981) provided a review of G theory for the years 19731980. Actually, their paper is much more than a literature review; it also provides a summary of G theory that is still highly relevant. Webb and Shavelson (1981), as well as Webb, Shavelson, and Maddahian (1983), provided overviews of MGT

218 Robert L. Brennan

in somewhat different contexts. Crocker and Algina (1986) devoted a chapter to an overview of G theory. Other overviews were provided by Algina (1989), Allal (1988, 1990), and Shavelson and Webb (1992). Shavelson, Webb, and Rowley (1989) provided a particularly readable journal article that summarizes G theory. In the same year, in the third edition of Educational Measurement, Feldt and Brennan (1989) devoted about one-third of their chapter on reliability to G theory. (See Haertel, 2006, for an updated and revised version.) Shortly thereafter Shavelson and Webb (1991) published Generalizability theory: A primer, which is still an excellent, accessible introduction to G theory. A year later, Brennan (1992b) provided a brief introduction to G theory in an NCME ITEMS module.

Theoretical Contributions Cronbach et al. (1972) noted that G theory is part of validity. Stated differently, G theory “blurs” distinctions between reliability and validity. Still, validity aspects of G theory were largely ignored until Kane (1982) published a lengthy paper entitled A Sampling model for validity. That paper makes numerous, important contributions to the literature on both G theory and validity, including an elegant resolution of the reliability-validity paradox. In several respects, Kane (1982) took a big step forward in demystifying the blurring of reliability and validity. As such, Kane (1982) is still the primary treatment of validity from a G theory perspective, although Kane, Crooks, and Cohen (1999) is a noteworthy extension in the context of performance assessments. One of the most enduring aspects of the measurement literature (including much of the G theory literature) is the focus given to reliability-like coefficients in the sense of coefficients that have a correlational interpretation, or something like it. Kane (1996) made a compelling argument that we need more meaningful and easily interpretable indices of measurement precision. His basic argument is that an SEM is more meaningfully evaluated relative to some tolerance for error, and he proposed using E/T indices. As illustrated previously, certain E/T indices can be transformed into reliability-like coefficients. In this author’s opinion, however, coefficients complicate interpretations and obfuscate the central importance of SEMs for sound decisions about measurement precision. The development of educational assessments is grounded in the tradition of using the same table of content specifications (TOCS) to develop forms of an assessment. From the perspective of G theory, in the simplest case: 1) the cells in a TOCS are fixed categories in the sense that every form will have the same cells, and 2) the items in each cell are random in the sense that different forms will have different sets of items for each cell. The resulting (potential) set of forms is perhaps the simplest example of a multivariate UG, with the fixed multivariate facet being the cells in the TOCS. In effect, stratified alpha (Cronbach, Schönemann, & McKie, 1965) is a generalizability coefficient for this multivariate UG,

Generalizability Theory 219

although the Cronbach et al. (1965) derivation does not directly reflect the MGT conventions in Cronbach et al. (1972) . By contrast, Jarjoura and Brennan (1982, 1983) and to an extent (Kolen & Jarjoura, 1984) provided a “full” MGT development of coefficients and error variances for the simple TOCS model. Their work provided a framework for extensions to more complicated TOCS models in later decades. Not long after Cronbach et al. (1972) was published, Cardinet, Tourneur, and Allal (1976) suggested that G theory could be extended through the introduction of “face of differentiation” and “face of generalization.” (Doing so introduced some inconsistencies with other G theory literature.) Cardinet, Tourneur, and Allal (1981) and Cardinet and Tourneur (1985) illustrated and expanded upon these notions. Much later, Cardinet, Johnson, and Pini (2010) provided a monograph treatment that is coordinated with the computer program EduG.

Applied Contributions The 1980s witnessed a mini-boom in G theory analyses and publicity. In particular, practitioners realized that understanding the results of a performance test necessitated grappling with two or more facets simultaneously, especially tasks and raters. Nubaum (1984) is a particularly good, early example of using MGT in performance testing. Cronbach et al. (1972) illustrated the applicability of G theory largely by reanalyzing already published data in the psychology and education literature; they did not collect their own data. This after-the-fact examination of available data is at least partly inconsistent with the ”best practices” G theory principal of defining a UAO before collecting data. This is an enduring problem with much of the applied research in G theory, even to this day. One exception is Brennan, Gao, and Colton (1995), who used MGT to assist in the design of an assessment program in listening and writing. Brennan and Johnson (1995) and Brennan (1996) addressed theoretical and applied issues in performance testing. The relevance of G theory in performance testing is especially well illustrated by Shavelson and his colleagues in numerous presentations and papers involving science and mathematics (see, for example, Shavelson, Baxter, & Pine, 1991, 1992; Shavelson, Baxter, & Gao, 1993; Ruiz-Primo, Baxter, & Shavelson, 1993; and Gao, Brennan, & Shavelson, 1994). Lane, Liu, Ankenmann, and Stone (1996) treated performance assessment in mathematics; Kreiter, Brennan, and Lee (1998), as well as Clauser, Clyman, & Swanson (1999), treated performance assessments in medical education; Bachman, Lynch, and Mason (1994) considered testing of speaking in a foreign language; and Ruiz-Primo, Baxter, and Shavelson (1993) considered the stability of scores for performance assessments. Performance assessments motivated Cronbach, Linn, Brennan, and Haertel (1995) to state:

220 Robert L. Brennan

Assessments depart from traditional measurements in ways that require extensions and modifications of generalizability analysis…. Assessments pose problems that reach beyond available psychometric theory. (p. 1) The Cronbach et al. (1995) report and a published revised version (Cronbach, Linn, Brennan, & Haertel, 1997) suggested a number of problems that needed to be researched and proposed some recommended solutions. These papers emphasized the importance of estimates of absolute standard errors of measurement for many of the types of decisions that are typically made with performance assessments. Also, these papers urge that an analysis of error for group means explicitly recognize that pupils are nested in classes and schools. Whether to treat pupils as fixed or random in such analyses is discussed as well (see, also, Brennan 1995a). The 1980s and 1990s witnessed a dramatic increase in the range of applications addressed using G theory including: program evaluation (e.g., Gillmore, 1983), counseling and development (Webb, Rowley, & Shavelson, 1988), setting performance standards (Brennan, 1995b), job performance (Webb, Shavelson, Kim, & Chen, 1989), and aspects of physiology (e.g., Llabre, Ironson, Spitzer, Gellman, Weidler, & Schneiderman, 1988). Also, Marcoulides and Goldstein (1990, 1992) considered optimizing G theory results given resource constraints. Since the 1970s, and especially in the 1980s and 1990s, Shavelson and Webb made numerous contributions to the applied G theory literature, probably more so than any other two researchers. Although only some of their publications are cited in this chapter, their extensive contributions are especially noteworthy.

The 21st Century The first two decades of the 21st century witnessed some theoretical developments in G theory, but mainly this century has seen summaries of G theory, occasional extensive treatments of the theory, more tools for G theory analyses, and more applications of G theory in a range of different disciplines beyond just education and psychology.

G Theory Book and GENOVA Suite In the early 1990s Springer invited me to write a G theory book. I accepted that invitation with the understanding that Springer first publish a book on equating that Michael Kolen and I were writing at that time. It wasn’t until 2001 that the G theory book, Generalizability theory (Brennan, 2001b), appeared. It took four years to write because my intent was ambitious. I wanted the book to be as up to date as possible, theoretically sound, and well-integrated (in both concepts and notation). This required research into certain topics (especially statistical matters) for which literature was lacking or obscure. Also, since MGT is rightly viewed as

Generalizability Theory 221

the “whole” of G theory, I somewhat expanded the conceptual and notational framework for MGT in Cronbach et al. (1972). I wanted the book to serve as a resource for both students and practitioners. To facilitate this goal, I spent months writing two new computer programs, urGENOVA (Brennan, 2001d) for unbalanced univariate designs and mGENOVA (Brennan, 2001c) for multivariate designs. These two programs along with GENOVA constitute the GENOVA suite. It was used for all examples in the book and is still widely used for generalizability analyses. Block and Norman (2017) provide a Windows wrapper for urGENOVA. In writing Generalizability theory, I wanted to help correct some misconceptions about G theory (Brennan, 2000a) and foster a better understanding of reliabilitylike coefficients (see Brennan, 2001a) such as generalizability coefficients. I much preferred Kane’s (1996) error/tolerance ratios to such coefficients. I even considered deleting generalizability coefficients from the book, but I feared that doing so would “turn off” practitioners who could benefit from using G theory. My compromise was to give less emphasis to coefficients and more emphasis to relative and absolute error variance, especially the latter and particularly its square root, the standard error of measurement (SEM) for “absolute” decisions, ðÞ. These perspectives are similar to those stated somewhat later by Cronbach (2004): Coefficients are a crude device that do not bring to the surface many subtleties implied by variance components. In particular, the interpretations being made in current assessments are best evaluated through use of a standard error of measurement. (p. 394) Later chapters of Brennan (2001b) provide fairly simple equations for conditional  absolute SEMs, denoted  p . (See, also, Brennan, 1998.)   Assuming persons are the objects of measurement, the basic idea is that  p is the within person SEM. The simplest example is Lord’s (1955, 1957, 1959) SEM that motivated the Cronbach team to draw the important distinction between relative and absolute error  over a half century ago. I have come to believe that often estimates of  p are the single most important G theory statistics. (Jarjoura, 1986, treats the much more difficult task of estimating conditional relative SEMs.)

Variability of Estimated Variance Components From a statistical perspective, variance components are at the heart of G theory. As noted earlier, however, estimated variance components (EVCs) are subject to sampling error, which some authors view as the “Achilles heel” of G theory. Cronbach et al. (1972) recognized that sampling errors were an issue and considered procedures for estimating standard errors (SEs) of EVCs and associated confidence intervals (see, also, Bell, 1985). The Cronbach et al. (1972) discussion

222 Robert L. Brennan

was limited, though, largely because the statistical literature on the topic was sparse at that time. That literature expanded considerably in the next two decades (see Searle, Casella, & McCulloch, 1992). Since then, various authors have addressed SEs of EVCs and/or associated confidence intervals including, Betebenner (1998) and Gao and Brennan (2001). Brennan (2001b, chap. 6) devoted an entire chapter to variability of EVCs, without coming close to exhausting it. For G theory, a significant problem with most of the the literature was that statistical procedures that assume normality seem problematic for typical testing data (particularly dichotomous data). An obvious non-parametric alternative is the jackknife (discussed in Cronbach et al., 1972; see, also, Brennan, 2001b, pp. 182–185), but it becomes prohibitive with even rather small datasets and/or complex designs. The bootstrap would seem to be an obvious alternative to the jackknife. A straightforward application of the bootstrap in G theory presents two challenges, however. First, for any particular design, there are as many possible bootstrap procedures as there are facets (including the objects of measurement “facet”), as well as their interactions. Second, the seemingly obvious ways to perform bootstrapping give different sets of biased results. For the p  i design, Wiley (2000) provided bootstrap-based results that are unbiased no matter which facet(s) is/are bootstrapped (see, also, Brennan, 2001b). Brennan (2007) extended Wiley’s approach to any balanced design, which Tong and Brennan (2007) illustrated using simulations for the p  i  h and p  ði : hÞ designs.

Applied Contributions G theory is uniquely suited to addressing the relative contributions of multiple facets to measurement precision, which is of considerable importance in numerous areas, particularly performance testing (see Brennan, 2000b). That is likely the reason for the widespread application of G theory with performance assessments since Cronbach et al. (1972) was published. Examples in the 21st century include Clauser, Harik, and Clyman (2000) who studied computer-automated scoring; Raymond, Harik, and Clauser,(2011) who examined adjustments for rater effects; Gadbury-Amyot, McCracken, Woldt, and Brennan (2012) who considered portfolio assessments in dental school; Gao, Brennan, and Guo (2015) who studied large-scale writing assessments; and Vispoel, Morris, and Kilinc (2018) who considered designing and evaluating psychological assessments. The use of G theory to study standard setting procedures also has been an enduring topic. See, for example, Clauser, Margolis, and Clauser (2014) as well as Clauser, Kane, and Clauser (2020). In the 20th century, most applications of G theory were univariate (UGT). In the 21st century, however, there has been an increasing use of MGT. Twenty-

Generalizability Theory 223

first century applications employing MGT, or at least aspects of its framework, include: Li and Brennan (2007) who studied a large-scale reading comprehension test; Clauser, Harik, and Margolis (2006) who examined a performance assessment of physicians’ clinical skills; Raymond and Jiang (2020) who considered indices of subscore utility; Clauser, Swanson, and Harik (2002), as well as Wu and Tzou (2015), who studied standard-setting procedures; Powers and Brennan (2009), as well as Kim, Lee, and Brennan (2016), who examined mixed format tests; and Yin (2005) who studied a licensure examination for lawyers.

Concluding Comments Predicting the future is a risky proposition, but the trend line of the last two decades certainly suggests that G theory will be used in an increasing number of contexts and disciplines. Also, since MGT is essentially the “whole” of G theory, I believe that both applications and theoretical extensions of MGT are likely to be noteworthy, particularly with respect to modelling tables of content specifications for tests. Although G theory is a very broadly defined psychometric model, certain aspect of G theory deserve further research and clarification. Five such areas are the following. 









Almost all of the G theory literature assumes that the scores of interest are raw scores (or weighted raw scores). In many contexts, however, primary interest is on scale scores that are non-linear transformations of raw scores. This issue is of considerable concern for the estimation of CSEMs (see, for example, Kolen, Hanson, & Brennan, 1992). The random sampling assumption in G Theory is almost never strictly true, which challenges interpretations of variance components, their estimates, and functions of them. Kane (2002) considers several perspectives on this matter. Applications of G theory sometimes have one or more facets that is/are “hidden” in the sense that they are unacknowledged and/or there is only one condition of each of them in a G study (e.g., using a single rater to evaluate examinee responses to a prompt). In such cases, interpreting results is quite challenging. Brennan (2017) called this “The problem of one.” Cronbach et al. (1972) acknowledged that in G theory the conditions of a random facet are unordered. Clearly, this restriction is limiting and can lead to challenging problems, especially if occasion is a facet (see, for example, Rogosa & Ghandour, 1991). There have been some attempts at integrating G theory and aspects of other psychometric models, particularly item response theory (e.g., Bock, Brennan, & Muraki, 2002; Brennan, 2006; Brennan 2007; Briggs & Wilson, 2007; and Kolen & Harris, 1987), but much more research is needed.

224 Robert L. Brennan

Whatever the future holds for G theory, it seems clear to this author that the basic features of G theory will survive. In a very real sense, G theory is embodied in the scientific method whereever/however it is practiced. Terminology within and across disciplines may differ, but most of the basic principles in G theory are pervasive—even when they are unrecognized or ignored. Recently, for example, Shavelson and Webb (2019) used terms and concepts from G theory to frame a discussion of a book entitled Generalizing from educational research.

Note 1 The author is grateful to Richard Shavelson and Michael Kane who provided very helpful comments on a draft of this chapter.

References Algina, J. (1989). Elements of classical reliability theory and generalizability theory. Advances in Social Science Methodology, 1, 137–169. Allal, L. (1988). Generalizability theory. In J. P. Keeves (Ed.), Educational research, methodology, and measurement (pp. 272–277). New York: Pergamon. Allal, L. (1990). Generalizability theory. In H. J. Walberg, & G. D. Haertel (Eds.), The international encyclopedia of educational evaluation (pp. 274–279). Oxford, UK: Pergamon. Bachman, L. F., Lynch, B. K., & Mason, M. (1994). Investigating variability in tasks and rater judgements in a performance test of foreign language speaking. Language Testing, 12, 239–257. Bell, J. F. (1986). Simultaneous confidence intervals for the linear functions of expected mean squares used in generalizability theory. Journal of Educational Statistics, 11, 197–205. Betebenner, D. W. (1998, April). Improved confidence interval estimation for variance components and error variances in generalizability theory. Paper presented at the Annual Meeting of the American Educational Research Association, San Diego, CA. Block, R., & Norman, G. (2017). G_String: A Windows Wrapper for urGENOVA. [Computer software and manual.] McMaster University, Hamilton ON, Canada. (Retrieved from http://fhsperd.mcmaster.ca/g_string/index.html) Bock, R. D., Brennan, R. L., & Muraki, E. (2002). The information in multiple ratings. Applied Psychological Measurement, 26, 364–375. Boodoo, G. M. (1982). On describing an incidence sample. Journal of Educational Statistics, 7(4), 311–331. Brennan, R. L. (1975). The calculation of reliability from a split-plot factorial design. Educational and Psychological Measurement, 35, 779–788. Brennan, R. L. (1977). Generalizability analyses: Principles and procedures. (ACT Technical Bulletin No. 26). Iowa City, IA: ACT, Inc. (Revised August 1978). Brennan, R. L. (1983). Elements of generalizability theory. Iowa City, IA: ACT, Inc. Brennan, R. L. (1984). Estimating the dependability of the scores. In R. A. Berk (Ed.), A guide to criterion-referenced test construction (pp. 292–334). Baltimore, MD: The Johns Hopkins University Press. Brennan, R. L. (1992a). Elements of generalizability theory (rev. ed.). Iowa City, IA: ACT, Inc. Brennan, R. L. (1992b). Generalizability theory. Educational Measurement: Issues and Practice, 11(4), 27–34.

Generalizability Theory 225

Brennan, R. L. (1995a). The conventional wisdom about group mean scores. Journal of Educational Measurement, 14, 385–396. Brennan, R. L. (1995b). Standard setting from the perspective of generalizability theory. In Proceedings of the joint conference on standard setting for large-scale assessments (Volume II). Washington, DC: National Center for Education Statistics and National Assessment Governing Board. Brennan, R. L. (1996). Generalizability of performance assessments. In G. W. Phillips (Ed.), Technical issues in performance assessments. Washington, DC: National Center for Education Statistics. Brennan, R. L. (1997). A perspective on the history of generalizability theory. Educational Measurement: Issues and Practice, 16(4), 14–20. Brennan, R. L. (1998). Raw-score conditional standard errors of measurement in generalizability theory. Applied Psychological Measurement, 22, 307–331. Brennan, R. L. (2000a). (Mis)conceptions about generalizability theory. Educational Measurement: Issues and Practice, 19(1), 5–10. Brennan, R. L. (2000b) Performance assessments from the perspective of generalizability theory. Applied Psychological Measurement, 24, 339–353. Brennan, R. L. (2001a). An essay on the history and future of reliability from the perspective of replications. Journal of Educational Measurement, 38, 295–317. Brennan, R. L. (2001b). Generalizability theory. New York: Springer-Verlag. Brennan, R. L. (2001c). Manual for mGENOVA. Iowa City, IA: Iowa Testing Programs, University of Iowa. [Computer software and manual.] (Retrieved from https://educa tion.uiowa.edu/casma) Brennan, R. L. (2001d). Manual for urGENOVA. Iowa City, IA: Iowa Testing Programs, University of Iowa. [Computer software and manual.] (Retrieved from https://educa tion.uiowa.edu/casma) Brennan, R. L. (2006). Perspectives on the evolution and future of educational measurement. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 1–16). Westport, CT: American Council on Education/Praeger. Brennan, R. L. (2007). Integration of models. In C. Rao and S. Sinharey (Eds.), Handbook of statistics: Psychometrics (Vol. 26) (pp. 1095–1098). Amsterdam: Elsevier. Brennan, R. L. (2010a). Generalizability theory. In P. Peterson, E. Baker, & B. McGaw (Eds.), International Encyclopedia of Education (3rd ed.), vol. 4, 61–68. Brennan, R. L. (2010b). Generalizability theory and classical test theory. Applied Measurement in Education, 24, 1–21. Brennan, R. L. (January, 2017). Using G Theory to Examine Confounded Effects: “The Problem of One” (CASMA Research Report No. 51). Iowa City, IA: Center for Advanced Studies in Measurement and Assessment, The University of Iowa. (Retrieved from https:// education.uiowa.edu/casma) Brennan, R. L. (January, 2020). Generalizability Theory References: The First Sixty Years. (CASMA Research Report No. 53). Iowa City, IA: Center for Advanced Studies in Measurement and Assessment, The University of Iowa. (Retrieved from https://educa tion.uiowa.edu/casma) Brennan, R. L., Gao, X., & Colton, D. A. (1995). Generalizability analyses of Work Keys listening and writing tests. Educational and Psychological Measurement, 55, 157– 176. Brennan, R. L., & Johnson, E. G. (1995). Generalizability of performance assessments. Educational Measurement: Issues and Practice, 14(4), 9–12.

226 Robert L. Brennan

Brennan, R. L., & Kane, M. T. (1977a). An index of dependability for mastery tests. Journal of Educational Measurement, 14, 277–289. Brennan, R. L., & Kane, M. T. (1977b). Signal/noise ratios for domain-referenced tests. Psychometrika, 42, 609–625. Brennan, R. L., & Lockwood, R. E. (1980). A comparison of the Nedelsky and Angoff cutting score procedures using generalizability theory. Applied Psychological Measurement, 4, 219–240. Briggs, D. C., & Wilson, M. (2007). Generalizability in item response modeling. Journal of Educational Measurement, 44, 131–155. Burt, C. (1936). The analysis of examination marks. In P. Hartog & E. C. Rhodes (Eds.), The marks of examiners. London: Macmillan. Burt, C. (1955). Test reliability estimated by analysis of variance. British Journal of Statistical Psychology, 8, 103–118. Cardinet, J., Johnson. S., & Pini, G. (2010). Applying generalizability theory using EduG. New York, Routledge. Cardinet, J., & Tourneur, Y. (1985). Assurer la measure. New York: Peter Lang. Cardinet, J., Tourneur, Y., & Allal, L. (1976). The symmetry of generalizability theory: Applications to educational measurement. Journal of Educational Measurement, 13, 119–135. Cardinet, J., Tourneur, Y., & Allal, L. (1981). Extension of generalizability theory and its applications in educational measurement. Journal of Educational Measurement, 18, 183–204. Clauser, B. E., Clyman, S. G., & Swanson, D. B. (1999). Components of rater error in a complex performance assessment. Journal of Educational Measurement, 36, 29–45. Clauser, B. E., Harik, P., & Clyman, S. G. (2000). The generalizability of scores for a perfomance assessment scored with a computer-automated scoring system. Journal of Educational Measurement, 37, 245–261. Clauser, B. E., Harik, P., & Margolis, M. J. (2006). A multivariate generalizability analysis of data from a performance assessment of physicians’ clinical skills. Journal of Educational Measurement, 43, 173–191. Clauser, B. E., Kane, M. T., & Clauser, J. C. (2020). Examining the precision of cut scores within a generalizability theory framework: A closer look at the item effect. Journal of Educational Measurement, 57, 159–184. Clauser, J. C., Margolis, M. J., & Clauser, B. E. (2014). An examination of the replicability of Angoff standard setting results within a generalizability theory framework. Journal of Educational Measurement, 51, 127–140. Clauser, B. E., Swanson, D. B., & Harik, P. (2002). Multivariate generalizability analysis of the impact of training and examinee performance information on judgments made in an Angoff-style standard-setting procedure. Journal of Educational Measurement, 39, 269–290. Cornfield, J., & Tukey, J. W. (1956). Average values of mean squares in factorials. Annals of Mathematical Statistics, 27, 907–949. Crick, J. E., & Brennan, R. L. (1983). Manual for GENOVA: A generalized analysis of variance system (American College Testing Technical Bulletin No. 43). Iowa City, IA: ACT, Inc. Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. New York: Holt. Cronbach, L. J. (1947). Test “reliability” Its meaning and determination. Psychometrika, 12, 1–16. Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 292–334.

Generalizability Theory 227

Cronbach, L. J. (1976). On the design of educational measures. In D. N. M. de Gruijter & L. J. T. van der Kamp (Eds.), Advances in psychological and educational measurement (pp. 199–208). New York: Wiley. Cronbach, L. J. (1989). Lee J. Cronbach. In G. Lindzey (Ed.), A history of psychology in autobiography (Vol. VIII). Stanford, CA: Stanford University Press. Cronbach, L. J. (1991). Methodological studies – A personal retrospective. In R. E. Snow., & D. E. Wiley (Eds.), Improving inquiry in social science: A volume in honor of Lee J. Cronbach (pp. 385–400). Hillsdale, NJ: Erlbaum. Cronbach, L. J. (2004). My current thoughts on coefficient alpha and successor procedures. (Editorial assistance provided by R. Shavelson.) Educational and Psychological Measurement, 64, 391–418. Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. New York: Wiley. Cronbach, L. J., Linn, R. L., Brennan, R. L., & Haertel, E. (1995, Summer). Generalizability analysis for educational assessments (Evaluation Comment). Los Angeles: University of California, Center for Research on Evaluation, Standards, and Student Testing. Cronbach, L. J., Linn, R. L., Brennan, R. L., & Haertel, E. (1997). Generalizability analysis for performance assessments of student achievement or school effectiveness. Educational and Psychological Measurement, 57, 373–399. Cronbach, L. J., Rajaratnam, N., & Gleser, G. C. (1963). Theory of generalizability: A liberalization of reliability theory. British Journal of Statistical Psychology, 16, 137–163. Cronbach, L. J., Schönemann, P., & McKie, T. D. (1965). Alpha coefficients for stratifiedparallel tests. Educational and Psychological Measurement, 25, 291–312. Crump, S. L. (1946). The estimation of variance components in analysis of variance. Biometrics Bulletin, 2, 7–11. Ebel, R. L. (1951). Estimation of the reliability of ratings. Psychometrika, 16, 407–424. Eisenhart, C. (1947). The assumptions underlying analysis of variance. Biometrics, 3, 1–21. Erlich, O., & Borich, C. (1979). Occurrence and generalizability of scores on a classroom interaction instrument. Journal of Educational Measurement, 16, 11–18. Erlich, O., & Shavelson, R. J. (1976). Application of generalizability theory to the study of teaching (Technical Report No. 76-79-1). Beginning Teacher Evaluation Study, Far West Laboratory, San Francisco. Feldt, L. S., & Brennan, R. L. (1989). Reliability. In R. L. Linn (Ed.), Educational Measurement (3rd ed., pp. 105–146). New York: American Council on Education and Macmillan. Finlayson, D. S. (1951). The reliability of marking essays. British Journal of Educational Psychology, 35, 143–162. Fisher, R. A. (1925). Statistical methods for research workers. London: Oliver & Bond. Gadbury-Amyot, C. C., McCracken, M. S., Woldt, J. L., & Brennan, R. (2012). Implementation of portfolio assessment of student competence in two dental school populations. Journal of Dental Education, 76, 1559–1571. Gao, X., & Brennan, R. L. (2001). Variability of estimated variance components and related statistics in a performance assessment. Applied Measurement in Education, 14, 191–203. Gao, X, Brennan, R. L., & Guo, F. (2015, August). Modeling measurement facets and assessing generalizability in a large-scale writing assessment. GMAC Research Reports, RR-15-01. Graduate Management Admission Council, Reston, Virginia. Gao, X., Brennan, R. L., & Shavelson, R. J. (1994, April). Estimating generalizability of matrix-sampled science performance assessments. Paper presented at the Annual Meeting of the American Educational Research Association, New Orleans.

228 Robert L. Brennan

Gillmore, G. M. (1983). Generalizability theory: Applications to program evaluation. In L. J. Fyans (Ed.), New directions for testing and measurement: Generalizability theory: Inferences and practical applications (No.18, pp. 3–16). San Francisco, CA: Jossey-Bass. Gillmore, G. M., Kane, M. T., & Naccarato, R. W. (1978). The generalizability of student ratings of instruction: Estimation of the teacher and course components. Journal of Educational Measurement, 15, 1–14. Gleser, G. C., Cronbach, L. J., & Rajaratnam, N. (1965). Generalizability of scores influenced by multiple sources of variance. Psychometrika, 30, 395–418. Gulliksen, H. (1950). Theory of mental tests. New York: Wiley. [Reprinted by Lawrence Erlbaum Associates, Hillsdale, NJ, 1987.] Haertel, E. H. (2006). Reliability. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 65–110). Westport, CT: American Council on Education/Praeger. Haggard, E. A. (1958). Intraclass correlation and the analysis of variance. New York: Dryden. Hoyt, C. J. (1941). Test reliability estimated by analysis of variance. Psychometrika, 6, 153–160. Jarjoura, D. (1986). An estimator of examinee-level measurement error variance that considers test form difficulty adjustments. Applied Psychological Measurement, 10, 175–186. Jarjoura, D., & Brennan, R. L. (1982). A variance components model for measurement procedures associated with a table of specifications. Applied Psychological Measurement, 6, 161–171. Jarjoura, D., & Brennan, R. L. (1983). Multivariate generalizability models for tests developed according to a table of specifications. In L. J. Fyans (Ed.), New directions for testing and measurement: Generalizability theory: Inferences and practical applications (No.18) (pp. 83–101). San Francisco, CA: Jossey-Bass. Joe, G. W., and Woodward, J. A. (1976). Some developments in multivariate generalizability. Psychometrika, 41(2), 205–217. Kane, M. T. (1982). A sampling model for validity. Applied Psychological Measurement, 6, 125–160. Kane, M. T. (1996). The precision of measurements. Applied Measurement in Education, 9, 355–379. Kane, M. T. (2011). The errors in our ways. Journal of Educational Measurement, 48, 12–30. Kane, M. T. (2002). Inferences about variance components and reliability-generalizability coefficients in the absence of random sampling. Journal of Educational Measurement, 39, 165–181. Kane, M. T., & Brennan, R. L. (1977). The generalizability of class means. Review of Educational Research, 47, 267–292. Kane, M. T., Crooks, T. J., & Cohen, A. (1999). Validating measures of performance. Educational Measurement: Issues and Practice, 18(2), 5–17. Kane, M. T., Gillmore, G. M., & Crooks, T. J. (1976). Student evaluations of teaching: The generalizability of class means. Journal of Educational Measurement, 13, 171–183. Kim, S. Y., Lee, W., & Brennan, R. L. (2016, December). Reliability of mixed-format composite scores involving raters: A Multivariate generalizability theory approach. In M. J. Kolen & W. Lee (Eds.), Mixed-Format Tests: Psychometric Properties with a Primary Focus on Equating (CASMA Monograph 2.4). Iowa City, IA: Center for Advanced Studies in Measurement and Assessment, The University of Iowa. (Retrieved from https://educa tion.uiowa.edu/casma) Kolen, M. J., Hanson, B. A., & Brennan, R. L. (1992). Conditional standard errors of measurement for scale scores. Journal of Educational Measurement, 29, 285–307. Kolen, M. J. & Harris, D. J. (1987, April). A multivariate test theory model based on item response theory and generalizability theory. Paper presented at the Annual Meeting of the American Educational Research Association, Washington, DC.

Generalizability Theory 229

Kolen M. J., & Jarjoura, D. (1984). Item profile analysis for tests developed according to a table of specifications. Applied Psychological Measurement, 8, 219–230. Kreiter, C. D., Brennan, R. L., & Lee, W. (1998). A generalizability study of a new standardized rating form used to evaluate students’ clinical clerkship performance. Academic Medicine, 73, 1294–1298. Kuder, G. F., & Richardson, M. W. (1937). The theory of estimation of test reliability. Psychometrika, 2, 151–160. Lane, S., Liu, M., Ankenmann, R. D., & Stone, C. A. (1996). Generalizability and validity of a mathematics performance assessment. Journal of Educational Measurement, 33, 71–92. Li, D., & Brennan, R. L. (2007, August). A multi-group generalizability analysis of a large-scale reading comprehension test. (CASMA Research Report No. 25). Iowa City, IA: Center for Advanced Studies in Measurement and Assessment, The University of Iowa. (Retrieved from https://education.uiowa.edu/casma) Lindquist, E. F. (1953). Design and analysis of experiments in psychology and education. Boston, MA: Houghton-Mifflin. Llabre, M. M., Ironson, G. H., Spitzer, S. B., Gellman, M. D., Weidler, D. J., & Schneiderman, N. (1988). How many blood pressure measurements are enough?: An application of generalizability theory to the study of blood pressure reliability. Psychophysiology, 25, 97–106. Lord, F. M. (1955). Estimating test reliability. Educational and Psychological Measurement, 15, 325–336. Lord, F. M. (1957). Do tests of the same length have the same standard error of measurement? Educational and Psychological Measurement, 17, 510–521. Lord, F. M. (1959). Tests of the same length do have the same standard error of measurement? Educational and Psychological Measurement, 19, 233–239. Lord, F. M. (1962). Test reliability: A correction. Educational and Psychological Measurement, 22, 511–512. Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. Loveland, E. H. (1952). Measurement of factors affecting test-retest reliability. Unpublished doctoral dissertation, University of Tennessee. Marcoulides, G. A., & Goldstein, Z. (1990). The optimization of generalizability studies with resource constraints. Educational and Psychological Measurement, 50, 761– 768. Marcoulides, G. A., & Goldstein, Z. (1992). The optimization of multivariate generalizability studies with budget constraints. Educational and Psychological Measurement, 52, 301–308. Medley, D. M., Mitzel, H. E., & Doi, A. N. (1956). Analysis of variance models and their use in a three-way design without replication. Journal of Experimental Education, 24, 221–229. Nußbaum, A. (1984). Multivariate generalizability theory in educational measurement: An empirical study. Applied Psychological Measurement, 8, 219–230. Pilliner, A. E. G. (1952). The application of analysis of variance to problems of correlation. British Journal of Psychology, Statistical Section, 5, 31–38. Powers, S., & Brennan, R. L. (2009, September). Multivariate Generalizability Analyses of Mixed-format Exams. (CASMA Research Report No. 29). Iowa City, IA: Center for Advanced Studies in Measurement and Assessment, The University of Iowa. (Retrieved from https://education.uiowa.edu/casma)

230 Robert L. Brennan

Rajaratnam, N., Cronbach, L. J., & Gleser, G. C. (1965). Generalizability of stratifiedparallel tests. Psychometrika, 30, 39–56. Raymond, M. R., & Jiang, Z. (2020). Indices of subscore utility for individuals and subgroups based on multivariate generalizability theory. Educational and Psychological Measurement, 80(1), 67–90. Raymond, M. R., Harik, P., & Clauser, B. E. (2011). The impact of statistically adjusting for rater effects on conditional standard errors of performance ratings. Applied Psychological Measurement, 35(3), 235–246. Rogosa, D., & Ghandour, G. (1991). Statistical models for behavioral observations. Journal of Educational Statistics, 3, 157–252. Ruiz-Primo, M. A., Baxter, G. P., & Shavelson, R. J. (1993). On the stability of performance assessments. Journal of Educational Measurement, 30, 41–53. Searle, S. R., Casella, G., & McCulloch, C. E. (1992). Variance components. New York: Wiley. Shavelson, R. J., Baxter, G. P., & Gao, X. (1993). Sampling variability of performance assessments. Journal of Educational Measurement, 30, 215–232. Shavelson, R. J., Baxter, G. P., & Pine, J. (1991). Performance assessments in science. Applied Measurement in Education, 4, 347–362. Shavelson, R. J., Baxter, G. P., & Pine, J. (1992). Performance assessments: The rhetoric and reality. Educational Researcher, 21(4), 22–27. Shavelson, R. J., & Webb, N. M. (1981). Generalizability theory: 1973–1980. British Journal of Mathematical and Statistical Psychology, 34, 133–166. Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Newbury Park, CA: Sage. Shavelson, R. J., & Webb, N. M. (1992). Generalizability theory. In M. C. Alkin (Ed.), Encyclopedia of Educational Research (Vol. 2), (pp. 538–543). New York: Macmillan. Shavelson, R. J., Webb, N. M., & Rowley, G. L. (1989). Generalizability theory. American Psychologist, 6, 922–932. Shavelson, R. J., & Webb, N. (2019). Generalizability theory and its contribution to the discussion of the generalizability of research findings. In K. Ercikan & W. Roth (Ed.), Generalizing from educational research: Beyond qualitative and quantitative polarization (pp. 13– 32). New York: Routledge. Sirotnik, K., & Wellington, R. (1977). Incidence sampling: An integrated theory for “matrix sampling.” Journal of Educational Measurement, 14, 343–399. Smith, P. L. (1978). Sampling errors of variance components in small sample generalizability studies. Journal of Educational Statistics, 3, 319–346. Smith, P. L. (1982). A confidence interval approach for variance component estimates in the context of generalizability theory. Educational and Psychological Measurement, 42, 459–466. Tong, Y., & Brennan, R. L. (2007). Bootstrap estimates of standard errors in generalizability theory. Educational and Psychological Measurement, 67(5), 804–817. Vispoel, W. P., Morris, C. A., & Kilinc, M. (2018). Practical applications of generalizability theory for designing, evaluating, and improving psychological assessments. Journal of Personality Assessment, 100, 53–67. Webb, N. M., Rowley, G. L., & Shavelson, R. J. (1988). Using generalizability theory in counseling and development. Measurement and Evaluation in Counseling and Development, 21, 81–90. Webb, N. M., Shavelson, R. J., & Maddahian, E. (1983). Multivariate generalizability theory. In L. J. Fyans (Ed.), New Directions in Testing and Measurement: Generalizability Theory, (No. 18), 67–82. San Francisco, CA: Jossey-Bass.

Generalizability Theory 231

Webb, N. M., Shavelson, R. J., Kim, K. S., & Chen, Z. (1989). Reliability (generalizability) of job performance measurements: Navy machinist mates. Military Psychology, 1, 91–110. Wiley, E. W. (2000). Bootstrap strategies for variance component estimation: Theoretical and empirical results. Unpublished doctoral dissertation, Stanford. Wu, Y. F., & Tzou, H. (2015). A multivariate generalizability theory approach to standard setting. Applied Psychological Measurement, 39, 507–524. Yin, P. (2005). A multivariate generalizability analysis of the Multistate Bar Examination. Educational and Psychological Measurement, 65, 668–686.

11 ITEM RESPONSE THEORY A Historical Perspective and Brief Introduction to Applications Richard M. Luecht and Ronald K. Hambleton1

It is an understatement to say that item response theory (IRT) has changed the landscape of modern measurement. At the same time, IRT did not just magically appear in the 1980s as a comprehensive set of models and parametric estimators, software tools, and applications. Rather, it has evolved and matured from an initial set of theoretical statistical modeling concepts and estimation procedures to now include a wide array of statistical models for different types of response data used across a range of testing applications. On the one hand, increasingly more sophisticated IRT models and improved estimation techniques remain a focus for many investigators involved in psychometric research. At the same time, IRT has significantly impacted operational assessment practice, allowing organizations to design and calibrate item banks that greatly facilitate test form production and score processing. IRT has also profoundly shaped how test developers design and assemble tests. Moreover, modern computerized adaptive and multistage test designs would not be feasible without IRT. This chapter presents some key research dating from the 1920s; it also includes a discussion of practical applications for operational assessment programs highlighting the advantages of IRT. We present our story of IRT in three parts, each representing a particular theme: (a) IRT models and modeling; (b) parameter estimation and software; and (c) practical IRT applications in operational practice. The full story of IRT is more complicated and intriguing than we are able to present here. Nonetheless, we hope that our story ties together in a coherent manner some of the important research and developments that made “modern IRT” the preeminent measurement theory throughout the world.

Item Response Theory 233

IRT Models and Modeling As Bock (1997) noted, IRT has a history that spans nearly a century (also see Hambleton & Swaminathan, 1985; Hambleton, Swaminathan & Rogers, 1991; van der Linden, 2016; Faulkner-Bond & Wells, 2016; Thissen & Steinburg, 2020). The underpinnings of IRT can certainly be found in two aspects of Thurstone’s (1925) seminal work on scaling. First, Thurstone provided a way to statistically “calibrate” test questions; that is, to statistically locate test questions relative to an underlying scale by using the observed patterns of examinee correct and incorrect responses. Second, he was able to visually demonstrate the empirical probabilities of correct responses as a function of that underlying scale.

Normal Ogive Modeling Ogive models—especially normal ogive models—had become a relatively standard way of demonstrating the relationship between an observed ability score and the proportion of success by examinees on tasks by the end of World War II (Thissen & Steinberg, 2020). Lawley (1943) proposed using the ogive function in both item analysis and test construction. Tucker (1946) explicitly characterized the relationship between an observed response and an underlying “true score” using a product-moment correlation, based on the normal-ogive model. Much of the earliest theoretical psychometric modeling research that led to what initially was called latent trait theory2 was concerned with two basic problems. The first involved statistically modeling a response function for binary-scored test items—what Lazarsfeld (1950) called trace lines and Tucker (1946) called item characteristic curves (ICCs). In fact, Frederic Lord continued to use the phrase item characteristic curve theory until his Journal of Educational Measurement article “Practical Applications of Item Characteristic Curve Theory” was published (Lord, 1977). It was in that article that Lord appears to have first used the phrase item response theory. However, specific to item response modeling, it was Lord who pulled together much of the more compelling published and unpublished research through the end of the 1940s in two seminal articles and a monograph: the first in Educational and Psychological Measurement (Lord, 1953a) and the second in Psychometrika (1953b). Both papers emerged from Lord’s monograph that served as his doctoral thesis (Lord, 1952) prepared under the direction of Harold Gulliksen at Princeton University. The monograph is often referred to as the birth of psychometric latent trait theory. The Educational and Psychological Measurement article (Lord, 1953a) carefully laid out the theoretical foundations of three important aspects of the model. First, this article explicitly made use of an ogive-shaped item characteristic function—that is, Pi(θ) ≡ Pi, conditional on a latent ability. He further clarified and illustrated the role of item discrimination and item difficulty in simultaneously locating and shaping the item characteristic function relative to the unknown θ. Second, he

234 Richard M. Luecht and Ronald K. Hambleton

conceptualized the item characteristic function as additive in the test—yielding a P test characteristic function (i.e., a true score,  ¼ Pi ). Third, Lord demonstrated the connection between an observed-score frequency distribution and the item characteristic function using the generalized binomial function. As already noted, Lord was clearly influenced by Lawley’s (1943) and Tucker’s (1946) earlier modeling work. But he was also able to directly integrate Lazarsfeld’s (1950) developments of the likelihood function, now comprised of independent, conditional item characteristic functions (i.e., item response probabilities relative to a unidimensional latent ability that was distinct from a true score). The Psychometrika article (Lord, 1953b) explicitly introduced the two-parameter normal-ogive (2PN) model to mathematically represent the item characteristic function (adapted3 from Lord, 1953b, p. 58–59): ð ai ð bi Þ 1 2 pffiffiffiffiffiffi e:5z dz P ðui ¼ 1j Þ  Pi ¼ ð1Þ 2

1 where the underlying trait, θ, is normally distributed with a mean of zero and variance of one and the item parameters are ai, a weight coefficient estimated from the nonlinear regression of θ, and bi, an item location on the θ metric. Working from Finney’s (1947) derivations of probit model estimators, Lord also derived the maximum likelihood estimators for the slope, ai, the location parameter, bi, and the ability parameters (see Equations 6 to 12). Following Tucker’s (1946) initiative to determine the correlation between a normal-ogive response function and a “true” score, Lord defined the slope (item discrimination) to be proportional to the correlation between the underlying latent trait, θ, and a normally distributed deviate underlying the normal-ogive function, Ri; that is, qffiffiffiffiffiffiffiffiffiffiffiffiffiffi ai ¼ Ri = 1  R2i ð2Þ Figure 11.1 demonstrates the utility of the normal-ogive function to characterize the probabilities at six values of θ where i ¼ ai ð  bi Þ for a single ICC (top image) using the cumulative normal function with a constant value (e.g., ai=a=1.0 as shown in the middle images). The bottom image shows six corresponding ICCs where differing bi parameters yield values of γ ¼ ð–1:25; –:75; –:25; :25; :75; 1:25Þ. Despite its theoretical utility to represent the ICCs, practical applications of the 2PN remained extremely limited by the cumbersome numerical computing steps needed to estimate the model parameters for even modest-length tests—a consideration that Lord (1953b) clearly recognized (p. 62). Lord (1953b) also introduced a pseudo guessing-adjusted three parameternormal-ogive model as Pi ¼ Pi þ ð1  Pi Þ=q as a type of formula-scoring mechanism for q alternatives in a binary-scored item, where Pi is the 2PN model

0.0

0.2

0.4

Pi

0.6

0.8

1.0

Item Response Theory 235

–3

–2

–1

0

1

2

3

–3 –1 1 3 γ

0.0 0.2 0.4

0.0 0.2 0.4 –3 –1 1 3

–3 –1 1 3 γ

0.8

–3 –1 1 3

0.0 0.4

0.0 0.4 0.8

0.0

–3 –1 1 3 γ

–3 –1 1 3

0.4 0.8

–3 –1 1 3

–3 –1 1 3 γ

–3 –1 1 3 γ

0.6 0.4 0.0 0.2

P(γ)

0.8

–3 –1 1 3 γ

0.0 0.2 0.4

0.0 0.2 0.4

0.8

–3 –1 1 3

0.0 0.4

Density 0.0 0.4 0.8

–3 –1 1 3

0.0 0.4 0.8

Density 0.0 0.2 0.4

0.0 0.2 0.4

γ

–3

FIGURE 11.1

–2

–1

0 γ

1

2

3

Normal-ogive probabilities at γ = (–1.25, –.75, –.25, .25, .75, 1.25) with corresponding ICCs

(Equation 1). Setting ci ¼ ð1  Pi Þ=q the model could also be written as Pi ¼ ci þ ð1  ci ÞPi , more closely aligning with Birnbaum’s subsequent logistic parameterization with a [random] pseudo-guessing adjustment of the response function.

The Initial Rasch Family of Models Georg Rasch was working in Denmark during the 1940s and 1950s on the same types of measurement problems that Lord was exploring in the United States.

236 Richard M. Luecht and Ronald K. Hambleton

Rasch’s earliest work introduced a Poisson model for calibrating educational data. That led to the formulation of a model for the probability of success on binary scored items given a person’s ability, ξ, and item difficulty, δ:

¼ =ð þ Þ:

ð3Þ

Following the publication of his book, Probabilistic Models for Some Intelligence and Attainment tests (Rasch, 1960), Rasch was invited to conduct a series of lectures at the University of Chicago where he began a long-term collaborative relationship with Benjamin Wright. It was not a coincidence that Rasch’s book was republished by MESA Press at the University of Chicago with the Foreword and Afterword both written by Wright (Rasch, 1960, 1980). Rasch and Wright’s strongest motivations were to create a family of models that satisfied the conditions of objective measurement where the latent scores do not depend on the particular collection of items used and where the item characteristics do not depend on the population of examinees used. As a mathematical statistician, Rasch had pondered this when demonstrating how, using a Poisson model, the conditional probability could be derived without reference to an underlying ability (see Rasch, 1960, 1980, pp. xxvii-xxviii). Their collaborative relationship was certainly instrumental to the development and proliferation of the Rasch family of models because it was actually Wright and his students and colleagues who subsequently created many of the extensions to the Rasch model for handling polytomous response data (e.g., Andrich, 1978; Masters, 1982). Wright also cultivated and promoted Rasch’s conceptualizations of invariance and specific objectivity (Rasch, 1960, 1961) suggesting that these were hallmarks of true measurement consistent with Thurstone’s original scaling motivations (Wright & Stone, 1979; Engelhard, 1984, 2008, 2013). As Wright stated in the Foreword to the republished edition of Rasch’s book, “Objective measurement, that is measurement that transcends the measuring instruments, not only requires measuring instruments which can function independently of the objects measured, but also a response model for calibrating their functioning, which can separate instrument and object effects.” (p. ix). Outside of Denmark, the first demonstrated application of the logistic Rasch model for item analysis involving binary response data was not published until almost a decade later when Wright and Panchapakesan (1969) presented the familiar one-parameter Rasch logistic model (1PL): P ðui ¼ 1j Þ  Pi ¼ =ð þ i Þ ¼ ½1 þ expðbi  Þ 1

ð4Þ

by introducing two simple identities, ¼ expð Þ and b ¼ expðÞ, to Rasch’s original model. Wright and Panchapakesan’s one-parameter Rasch model therefore characterized the item response probability function as the simple difference

Item Response Theory 237

between the latent trait, θ, and an item location parameter (i.e., a difficulty parameter, bi, in the context of most tests).

Emergence of the 2PL and 3PL Models As Bock (1997) noted, it was Alan Birnbaum who formalized and integrated many of the key theoretical elements of IRT as presented in his chapters in Statistical Theories of Mental Test Scores (Lord & Novick, 1968) and earlier in a set of Air Force Reports (Birnbaum, 1958a, 1958b). Birnbaum credited Berkson (1953, 1957) with creating the necessary theoretical justification and estimators for a logistic response functions conditional on a singular latent trait, θ. Birnbaum substituted these more computational convenient logistic functions for the more numerically cumbersome, normal-ogive model originally used by Lord using Berkson’s estimators. Birnbaum (1967, 1968) further noted that an advantage of the logistic function was that it did not require assumptions about the probability distribution of the latent trait, θ. Computationally, the use of a logistic function simplified the partial derivative terms essential for numerically estimating the maximum likelihood estimates of the item and person parameters. Additionally, Birnbaum proposed using the full response pattern to estimate the latent scores and he developed the important concept of conditional measurement information and conditional error variances for maximum likelihood estimates of θ. Finally, he introduced the more general three-parameter logistic (3PL) model, P ðui ¼ 1j Þ  Pi ¼ ci þ ð1  ci Þf1 þ exp½ai ð  bi Þ g1

ð5Þ

where ai is an item scoring weight often referred to as the discrimination parameter, bi is an item location representing the inflection point on the θ metric where the response probability is ½ðPi  ci Þ=ð1  c2 Þ =2, and ci is a pseudoguessing parameter that represents a left-adjusted shift in the response function to account for random noise (guessing) when θ is low-valued.

The D Constant Slight variations on the 2PL and 3PL models also began to appear in the IRT literature during the 1970s (e.g., Hambleton & Cook, 1977). The model was by then typically re-expressed as P ðui ¼ 1j Þ  Pi ¼ ci þ ð1  ci Þf1 þ exp½Dai ð  bi Þ g1

ð6Þ

where the constant D in the exponent was set to 1.702 (or rounded to D=1.7)). As Camilli (1994) noted, this constant serves a rather simple purpose: to allow the

238 Richard M. Luecht and Ronald K. Hambleton

2PL and 3PL models to approximate the earlier normal-ogive models as shown in Figure 11.2. Haley (1952) had already derived the constant as a means of connecting the logistic function with the normal-ogive function. However, the logical justification for including the D-constant in the 3PL was surely to place the item and person statistics from the 2PL and 3PL model on a common reporting scale. In fact, the only practical utility of this scaling transformation was to be able to interchangeably use large-scale approximations of the normal-ogive item-parameter estimates (e.g., Urry, 1977) and the logistic model parameter estimates in a common item bank. Other than convention, there seems to be little need for the D-constant in modern IRT calibration and scoring.

IRT Modeling Extensions IRT modeling in the 1960s and 1970s expanded along two distinct paths. One path involved some rather creative extensions of the 2PN and 2PL models for polytomous data (i.e., ordered response data such as partial-credit constructedresponse scores or Likert attitudinal response scales) and unordered responsechoice response. The second path also involved the development of models for polytomous data but stayed within the Rasch model family—requiring sufficient statistics. By the mid-1960s, Fumiko Samejima (1969, 1972) had independently initiated research on alternative models for educational performance data scored in ordered categories such as Likert-type response formats and for multiple selections. She developed both normal-ogive and logistic versions of what she called the graded response model (GRM) and generalized GRM. The GRM characterizes the 1.0 Normal cdf(γ) Logistic, D = 1.0 Logistic, D = 1.702

0.8

Pi (θ)

0.6 0.4 0.2 0.0 –3

FIGURE 11.2

–2

–1

0 θ

1

2

3

Comparisons of the three-parameter normal-ogive and logistic functions with and without D = 1.702

Item Response Theory 239

probability of particular scores or higher from ordered item observed scores (e.g., xi ∈ [0 to mi] on an essay test item, i, or partial-credit scoring for a constructedresponse item). The logistic version of the GRM can be written as P ðXi xik j Þ  Pik ¼ f1 þ exp½ai ð  bik Þ g1

ð7Þ

where ai is a slope parameter and bik is a location parameter for what Samejima termed the boundary probabilities. For example, an item with ordered response scores of x = (0,1,2,3) would have three boundary response functions, P ðXi 1j Þ ¼ Pi1 , and P ðXi 2j Þ ¼ Pi2 , P ðXi 3j Þ ¼ Pi3 , where P ðXi 0j Þ = 1. Probabilities for specific categorical responses can easily be obtained by subtracting adjacent boundary probabilities; e.g., P ðXi ¼ 2j Þ ¼ Pi1  Pi2 and P ðXi ¼ 0j Þ ¼ 1  Pi1 , simplifying to the 2PL for binary scores. Shortly thereafter Darryl Bock (1972) proposed a nominal response model (NRM) as an IRT-specific extension of the more general multinomial logit model. The NRM can be expressed as   P i P ðXi ¼ xik j Þ  Pik ¼ expðaik þ cik Þ= m ð8Þ j¼1 exp aij þ cij where aig is a category-specific slope parameter and cig is an intercept term for mi categorical choices associated with item i. Using a relatively simple reparameteriza0 0 tion of the model, where ½ai ci ¼ ½i i Ti, where Ti is a ðmi  1Þ  m matrix of linear constraints, Bock went on to elegantly present a highly general multinomial logit model,   P i ð9Þ P ðXi ¼ xik j Þ  Pik ¼ expðzi Þ= m g¼1 exp zig where

z0i ¼ ½1 



 i Ti i

ð10Þ

Bock’s model was conceptually similar to Samejima’s heterogeneous GRM, but more general. Unfortunately, neither model has seen widespread use except as an item-analysis tool for analyzing multiple-choice distractor patterns and for estimating the parameters of Wainer et al’s (2007) testlet model. However, their initial collaboration at the University of North Carolina-Chapel Hill proved to be instrumental in harnessing and applying Bock’s considerable expertise in deriving creative Bayesian estimation solutions for complicated IRT estimation problems. Some of those solutions are presented in the next section of this chapter.

240 Richard M. Luecht and Ronald K. Hambleton

Other extensions of IRT models were tied to Rasch modeling traditions and resulted in relatively simple extensions of the one-parameter logistic model, often drawing on Rasch’s earlier generalizations of his logistic model to polytomous data. For example, Andrich (1978, 2016) developed the well-known rating-scale model (RSM): Pix ¼ exp

Px

k¼0 ½

 ðbi  dk Þ =

Pm

exp

k¼0

Pk  j¼0

   bi  dj

ð11Þ

where bi is an item parameter that locates the average response function relative to θ and dk is a threshold parameter that locates equally a common set of rating category response functions relative to bi. A simple reparameterization of the model creates category-specific step parameters for each item, dik=bi+dk. Masters (1982) later explicitly included item-specific score points and associated threshold parameters, dij, for j=1 to mi ordered categories, in his the partial-credit model (PCM). That is, Pik ¼ exp ¼ exp

Px

k¼0 ½

Px

 ðbi  dik Þ =

k¼0 ð  dik Þ=

Pm k¼0

Pmi 1 k¼0

exp

exp

Pk g¼0



Pk j¼0

    bi  dig

  dij :

ð12Þ

The threshold parameters are distinct from the item difficulty in most applications. An important feature of the upper parameterization of the PCM in Equation 12 is that the threshold parameters can also be constrained to be equal across items. Those constraints make Andrich’s (1978) rating scale model (RSM) a special (constrained) case of the PCM. Development of new models continued over the decades following Samejima’s initial efforts. For example, David Thissen and Lynne Steinberg (1984) introduced the multiple-choice model (MCM) as an extension of Bock’s (1972) NRM to extract information from distractor patterns. Thissen and Steinberg (1986) continued their collaboration, developing a taxonomy for most of the popular IRT models that had been introduced up to that time. Still later, Muraki (1992, 1993) extended Masters’ PCM by adding an item discrimination parameter, ai. Muraki’s generalized partial-credit model (GPCM) can be written as Pik ¼ exp ¼ exp

Px k¼0

Px

ai ½  ðbi  dik Þ =

k¼0 ai ð  dik Þ=

Pm

Pmi 1 k¼0

k¼0

exp

exp

Pk g¼0

   ai  bi  dig

  j¼0 ai  dij :

Pk

ð13Þ

Like Masters (1982) and Masters and Wright (1984), Muraki’s model allows constraints to be placed on the item-specific parameters.

Item Response Theory 241

By the mid-1990’s, the proliferation of new and compelling parameterizations for the response functions in IRT modeling was surpassed only by the lack of estimators, computer software, and practical applications of all of the models being proposed by researchers in psychology and education. For example, van der Linden and Hambleton’s (1997) Handbook of Modern Item Response Theory presented twenty-seven IRT models ranging from multidimensional extensions of the logistic models to response time models. But consistent and robust estimators had only been developed for a small number of those models. And, the number of IRT models supported by accessible software calibration and scoring packages in the 1990’s was even smaller.

IRT Parameter Estimation and Software The theoretical groundwork for IRT had been laid by the end of the 1970s despite the continued emergence of new models during the 1980s and 1990s. But another technical problem needed to be solved: model-parameter estimation. Most of the initial estimation focused on Fisher’s (1925) introduction of maximum likelihood estimation (MLE) and Fisher and Yate’s (1938) development of the logistic item response function and the Newton-Raphson estimator. As noted earlier, Lord (1953b) made direct use of Finney’s MLE approach for the probit model to demonstrate the partial derivatives of the log likelihood function for the 2PN taken with respect to ai, bi, and θ. By the end of the 1970s joint maximum likelihood estimation (JMLE) began to emerge as the pre-eminent method for parameter estimation. However, Bayesian methods initially proposed in Lord and Novick (1968) and formalized in Bock and Lieberman (1970) and Bock and Aiken (1981) also began to emerge. The Bayesian estimation methods required assumptions about plausible prior distributions for parameters of a particular IRT model. Some, steeped in the MLE tradition, may have been uncomfortable with those assumptions. However, the development of efficient Bayesian estimators—especially Bock and Aiken’s (1981) implementation of marginal maximum likelihood using an adaptation of Dempster, Laird, and Rubin’s (1977) expectation-maximization (EM) algorithm—and the subsequent introduction of BILOG (Mislevy & Bock, 1983) quickly moved IRT along the path toward wide-spread use. However, that moves us too far ahead in the story of IRT parameter estimation. It is helpful to first understand why Bayesian influence was even necessary.

JMLE and CMLE for the Rasch Family of Models Estimation of the parameters under the Rasch family of models was not a particular problem because Rasch (1960, 1961) had already demonstrated that consistent estimates of θ and bi could be obtained through conditional MLE (also see Andersen, 1973). In the unconditional case, the sufficiency of the raw scores for

242 Richard M. Luecht and Ronald K. Hambleton

estimating the 1PL parameters has often been presented by setting the first partial derivative of the log likelihood function taken with respect to θ to zero such that P f 0 ¼ @lnðLÞ=@ ¼ x  ni¼1 Pi ð14Þ P and x ¼ ni¼1 Pi (e.g., Wright & Panchapakesan, 1969; Wright & Stone, 1979; Lord, 1980). A similar estimator exists for the item-difficulty parameters. As noted above, Andersen (1973) also formally derived conditional MLEs that would yield consistent estimates of the person and item parameters under the 1PL Rasch model. However, the computing power needed to implement conditional MLE proved to be rather impractical for most applications. Wright and Panchapakesan (1969) had already derived the unconditional JMLE for the 1PL. Their approach would soon become the standard way to estimate the item parameters for most models within the Rasch family. Under the Rasch 1PL model the second derivatives needed to apply Fisher’s Newton-Raphson P algorithm likewise simplify to f 00 ¼ @ 2 lnðLÞ=@ 2 ¼ ni¼1 Pi ð1  Pi Þ so that the estimator updates the parameter estimates within each iterative cycle, ^r ¼ ^r1 þ f 0 =f 00 ;

ð15Þ

until f 0 is suitably small for all of the estimated θ scores (e.g., Wright & Panchapakesan, 1969; Wright & Stone, 1979). A similar iterative algorithm is applied for the item difficulty parameters. Wright & Masters (1982) also extended the unconditional joint maximum likelihood solutions for the RSM and PCM. However, the unconditional JMLE method introduced by Wright and Panchapakesan (1969) is not really joint estimation. In practice, this two-step version of JMLE first computes the estimates of θ for all persons using fixed, initial approximations of the b-parameters and then uses those score estimates to compute the item-difficulty estimates. The process iterates until both sets of estimates become adequately stable. As noted, this type of iterative JMLE procedure also readily extends to the RMS and PCM. Today, JMLE remains the most popular estimation method implemented in Rasch-model software.

JMLE for the 2PN, 2PL and 3PL Models The parameter-estimation problem became more complex for the 2PN, 2PL and 3PL models—and would have certainly proved to be intractable for more complicated models like the GRM and the NRM. As previously noted, Lord (1953b) had extended Tucker’s (1946) derivations of the log-likelihood equations and derivatives under the 2PN with a latent variable, θ, but was not able to demonstrate a successful implementation of the estimators until much later.

Item Response Theory 243

The fundamental problem for JMLE estimation with more complicated IRT models was a lack of demonstrable statistical consistency; that is, proof of convergence of the estimates to the true parameters over infinitely large samples (Fisher, 1925; Neyman & Scott, 1948). Birnbaum (1968) succinctly summarized what only later was recognized as a potentially serious estimation problem (also see Lord, 1980) by proposing a maximum likelihood estimator of θ for the 2PL model. Setting the first partial derivative to zero, @lnðLÞ=@ ¼ 0; under the 2PL gives the result Pn i¼1

ai ui ¼

Pn i¼1

ai f1 þ exp½ai ð  bi Þ g1 :

ð16Þ

In other words, the weighted raw score is a sufficient statistic if the weight (item discrimination) parameters are known. Without a sufficient statistic, there is no basis to justify treating the estimated θ parameters as incidental and simply summing over them. The JMLE problem was even more complex for the 3PL model where there are four unknown parameters to be estimated (i.e., ai, bi, ci for i = 1,…,n items and θj, j = 1,…,N for all uniquely observed response patterns). The log-likelihood derivative terms become P



uh Ph @Ph h Ph Qh @v



¼0

ð17Þ

where the summation is over items for the N response patterns and over response patterns for the n items. This general formulation only requires that derivatives be taken with respect to the model parameters of interest, v, (adapted from Lord, 1980, pp. 179-180). However, as already noted, JMLE is typically implemented using an iterative, two-step, estimation procedure of first estimating the θ values and then fixing those to estimate the item parameters. That is not joint estimation! Warm (1989) proposed an information-weighting strategy that used an empirical Bayes rationale to correct for the potential bias and large-sample inconsistency in maximum likelihood estimates. Interestingly, his weighting correction—termed weighted maximum likelihood or WMLE—was later implemented in the WINSTEPS software package (Linacre, 2019) for Rasch model applications. However, his approach was never implemented for the 2PL or 3PL models, nor for the GRM or GPCM.

Marginal Maximum Likelihood Estimation (MMLE) A creative solution to the JMLE problem was provided via a collaboration between Bock and Aiken. Combining Samejima’s (1969) rigorous proof of the conditions for the existence of the maximum of the posterior likelihood for any pattern of dichotomous or polytomous response using Bayesian statistics, Bock &

244 Richard M. Luecht and Ronald K. Hambleton

Lieberman’s (1970) marginal maximum likelihood (MMLE) Bayesian solution over a quadrature grid to approximate the θ distribution, and Dempster, Laird, and Rubin’s (1977) expectation-maximum (EM) algorithm, Bock and Aiken (1981) were able to develop an EM algorithm to implement MMLE. Under Bock & Aiken’s implementation of the EM algorithm for MMLE, the estimates of individual values of θ are replaced by the expected posterior distribution for the latent variable, P ð jU Þ. The associated integration needed to sum over the posterior is also replaced by a discretized Gauss-Hermite quadrature grid of points, θk, k = 1,…,q. That is, given the unique posterior distribution for each response pattern, r, can be expressed as P ð k jUr Þ ¼

Pn i¼1

Pikui Quiki pk =P~

ð18Þ

where Pik ≡ Pi(θk), Qik = 1 – Pik and P~ is a normalizing sum needed to form a proper probability density function. Bock & Aiken further demonstrated that these posterior distribution terms can be amassed as pseudo-counts by simply apportioning the frequency of each observed response pattern, U, at each of the q points. Merging Equations 17 and 18, we arrive at what are essentially expected derivative terms that can be maximized in the “M” step of the EM algorithm by using Newton-Raphson iterations until Pq k

P

u Pik i2r i

Pik Qik



@Pik i



P ð k jUr Þ ¼ 0

ð19Þ

is satisfied to an acceptable tolerance level. The significance of this EM contribution to IRT estimation is often missed— perhaps because of the technical sophistication of the derivations (also see Swaminathan & Gifford, 1982, 1985a, 1985b; Thissen, 1982). The significance is two-fold. First, MMLE specifically separates item calibration—that is, estimating the item parameters—from the estimation of θ. That separation makes it feasible to employ efficient sampling strategies to optimize the calibration of one or more test forms or an item bank independent of examinee scoring and scaling procedures. Second, as long as we can specify a log-likelihood function and associated derivative terms for the structural parameters, this same basic EM algorithm extends to all of the models presented in this chapter. In fact, it can even be applied to multidimensional IRT estimation (e.g., Bock, Gibbons, & Muraki, 1988). Although Markov-chain Monte Carlo methods (MCMC) have emerged in the last two decades for multidimensional IRT and related factor analytic applications (e.g., Cai, 2010), MMLE remains the preferred method still used for operational item calibration with the 2PL, 3PL, GRM and GPCM unidimensional models.

Item Response Theory 245

IRT Software Practical applications of IRT were extremely limited until at least the mid-1980’s (Hambleton & Cook, 1977, Bock, 1997, Thissen & Steinberg, 2020) largely due to the lack of accessible software4. Researchers at ETS had access to Wood, Wingersky and Lord’s (1986) LOGIST software and researchers elsewhere had access to BICAL (Wright & Mead, 1978). But convenient packages for large-scale research and operational use simply did not exist until five or six years later. A number of relatively low-cost commercial software packages emerged in the 1980’s and 1990’s, largely stemming from two organizations: MESA at the University of Chicago, headed by Benjamin Wright, and Scientific Software International (SSI), co-founded by Darryl Bock (independent of the University of Chicago). Wright & Linacre (1983) released MSCALE, a microcomputer program that used JMLE to estimate the parameters of the 1PL Rasch model, the RSM, and the PCM. A new version, BIGSTEPS (Wright & Linacre, 1991) was later released primarily for institutions having Unix computers. WINSTEPS (Wright & Linacre, 1998; Linacre, 2020) incorporated all of the features of MSCALE and BIGSTEPS, but also provided a graphical user interface. SSI distributed three IRT calibration packages that dealt with non-Rasch applications. MULTILOG (Thissen, 1983, 1991) estimated using MMLE the item characteristics for multi-category data that included the 2PL model as well as the GRM, NRM and MCM. From the software’s perspective, the latter three models were implemented as special cases of the multinomial logit model with constraints on a design matrix (T in Equation 10). SSI also distributed BILOG (Mislevy & Bock, 1983) and its multigroup predecessor with an enhanced graphical interface, BILOG-MG (du Toit, 2003). BILOG and BILOG-MG implemented MMLE to estimate the item characteristics and further offered MLE scoring, Bayes expected a posteriori (EAP) score estimates (Bock & Mislevy, 1982), or maximum a posteriori (MAP) estimates (Mislevy, 1986). Although there are certainly numerous other IRT packages available, today— including a plethora of IRT application packages now available via the R programming language (R Core Team, 2020)—the simple facts are that, together, MESA/WinSteps.com and SSI (later VPG) brought IRT applications to fruition throughout the world since the mid-1980’s by providing accessible software to end users.

Invariance, Robust Estimation and Data-Model Fit The challenges for IRT were far from solved during the 1970’s and 1980’s merely by the development of improved estimation and software. In fact, three new related discrepancy-analytic issues arose during the late 1980’s and continued into the 1990’s and beyond: (i) detection and treatment of data-model misfit; (ii)

246 Richard M. Luecht and Ronald K. Hambleton

parameter invariance and estimation robustness; and (iii) idiosyncratic versus intentional multidimensionality. These are complex issues; it is well beyond the scope of this chapter to accurately delve historically into any of these issues. However, it would also be negligent on our part not to briefly mention them. All three issues involve some type of residual—that is, a discrepancy involving three quantities (usually considered pair-wise): (1) the observed item response scores, uij (indexed for item i and person j); (2) the theoretically true model as  expected response function (ERF), E ui j j ; i  Pij , with model parameters, θj and ξi; and (3) the estimated ERF, E ui j ^j ; ^i ¼ P^ij . Discrepancies can involve deviations or functions of the deviations between any of these three quantities. Optionally, we can consider discrepancies for observed response data collected under different conditions (e.g., paper-and-pencil versus computer-based testing). Or, we can consider discrepancies between different models using the estimated ERFs or for different groups. Data-model fit looks for discrepancies between the observed data and the estimated ERFs. Large discrepancies imply that the item parameter estimates and/ or scores may be misrepresented. For example, Yen’s (1984) Q1 and Q3 statistics considered the magnitude variance and covariance of a type of residual under the 3PL model. When respectively aggregated over examinees by item or across item by examinee, Wright & Stone (1979) and Masters and Wright (1984) developed several item- and person-fit measures that continue to be used for the Rasch family of models. Hambleton, et al. (1991) and Wells and Hambleton (2016) summarize and provide examples of some useful graphical techniques for evaluating datamodel fit. Glas (2016) provides a summary of many useful statistical residual misfit indicators. Finally, Sinharay (2016) summarizes Bayesian data-model fit and associated residual analyses. Parameter invariance studies are most commonly concerned with differences in the estimated ERFs for two or more examinee population subgroups or different conditions of measurement. A large class of discrepancy-analytic differential item functioning (DIF) methods have been developed over the past three decades (see, for example, Holland & Wainer, 1993; Camilli, 2006; Penfield & Camilli, 2006). Engelhard (2013) and others have extended parameter invariance concept to scoring processes and raters. Multidimensionality detection is typically analyzed in one of three ways. The first involves analyzing patterns of residuals that may violate the local independence assumption required of unidimensional IRT models due to idiosyncratic or nuisance factors (McDonald, 1981, 1999; Yen, 1984, 1993). The second approach tends to use some variant of principal components or factor analysis to fit additional components or factors to the residualized correlation or covariance matrix (e.g., Ackerman, 1992, 2006). The third approach uses likelihood-based methods to more globally assess fit at the test level (e.g., Orlando & Thissen, 2000—also see Glas, 2016, Cohen & Cho, 2016, and Sinharay, 2016).

Item Response Theory 247

Practical IRT Applications Frederic Lord was clearly one of the first to see the broad number of applications that could be developed via IRT. That is not to imply that Lord developed those applications. However, he did have the vision to see beyond technological limitations of the time (e.g., limiting computing power, storage limitations, networking and software). In fact, although his earliest work focused on the theoretical underpinnings of what was to become IRT, Lord’s most effective contributions can be attributed to the four practical applications proposed in his 1977a article and later expanded in his 1980 book, Applications of Item Response Theory to Practical Testing Problems. First, Lord demonstrated how to characterize the observed-score distribution conditional on θ from the item characteristic functions, using the generalized binomial function5. His approach created a direct connection between classical test theory and IRT since the mean and variance could respectively be estimated as

^X ¼ N 1

PN Pn j¼1

i¼1

ð20Þ

P^ij

where P^ij ¼ Pi ð ^j Þ is the estimate-based item characteristic function and ^2X ¼ N 1

PN  j¼1

^X 

Pn i¼1

2 P^ij :

ð21Þ

Lord further showed that the characteristic functions could also be used to estimate the average error variance and estimate the classical test reliability coefficient, h  i PN Pn 1 ^ ^ ^XX0 ¼ 1  ^2 ð22Þ X N j¼1 i¼1 P ij 1  P ij : Third, Lord devoted many of his publications in the 1970’s and 1980’s to the application of Birnbaum’s (1968) extension of Fisher’s scoring information function for test design. For example, his 1977 article explicitly used the phrase target test information function—a phrase that would later permeate much of the research literature on automated test assembly (ATA; see van der Linden, 2005; Luecht, 2014). The idea is straightforward with IRT. Put the measurement precision or peak test information where it is most needed with respect to specific values of θ or within specified regions/intervals along the latent scale. Fourth, Lord promoted using conditional scoring information for test assembly —for conventional, fixed-item test forms, for item-level adaptive tests, and even for multistage tests (also see Samejima, 1977). Finally, Lord made extensive contributions to exploiting IRT for test-score equating. Many of these applications were not feasibly implemented until much later, when computer technology and computational algorithms caught up. Lord clearly anticipated the value of

248 Richard M. Luecht and Ronald K. Hambleton

statistically decomposing test-level functions into additive item functions, conditional on an unobserved latent variable. Here, we expand on Lord’s contributions to IRT by summarizing some of the operational applications that have emerged since the early 1990’s. We limit our story regarding this theme to three broad classes of applications: (1) item banking and scale maintenance; (2) test design and assembly; and (3) computerized adaptive testing (CAT) including item-level and multistage testing (MST). More complete descriptions of IRT applications are available in numerous books, journal papers and technical reports (e.g., see van der Linden, 2016; Wells & Faulkner Bond, 2016; Drasgow, 2016; van der Linden & Glas, 2010).

Item Banking and Scale Maintenance Lord (1977, 1980) envisioned using IRT-calibrated items to populate future test forms, eliminating the need for subsequent form-to-form test-score equating6. He referred to this as pre-equating. Pre-equating changes the focus to establishing and maintaining the latent proficiency scale, θ, for an entire item bank so that any items selected from the bank can be used in test design and assembly as well as scoring. Pre-equating is also a pre-requisite for adaptive testing. However, preequating implies that all items have been calibrated to the same item-proficiency scale, θ. As new items are developed, they need to be calibrated and then linked to the common θ metric. There are number of effective IRT equating methods that have been developed for this purpose (see, for example, Wright, 1977; Wright & Stone, 1979; Lord, 1980; Stocking & Lord, 1984; Wingersky, Cook, & Eignor, 1987; Hanson & Béguin, 2002; Kolen & Brennan, 2014). In general, there are two approaches to linking new items to an existing proficiency scale, both of which exploit including some number of linking items on all new test forms7. One approach involves three basic steps. The first step is to separately calibrate all new test forms. This is called a local calibration. The second step uses items already calibrated on the θ scale—called linking items—to estimate an appropriate statistical linking function between the locally calibrated item parameter estimates and their item-bank estimates. Finally, the linking function is applied to the item parameter estimates for all new (previously uncalibrated) items. That is, h i ^ið B Þ ¼ linkA!B ^ið A Þ ð23Þ where linkA!B ½: denotes a [linear] linking function from the local scale, θA, to the bank scale, θB, estimated via the common items (Kolen & Brennan, 2014). The linked item parameters for all new items, ^ið B Þ are then added the item bank database. Once on the bank scale, θB, the items can be used along with any other items in the bank to build new test forms and for scoring, without any need to necessarily re-estimate their statistical psychometric characteristics.

Item Response Theory 249

The second approach is called anchored calibration. The sampling design can be identical to the local calibration-and-link design or highly sophisticated. The primary difference rests on how the calibrations are performed. For the anchor calibration, the linking items are treated as known and therefore fixed (i.e., constrained to their existing item bank parameter estimates) during the calibration. Any new items are simultaneously and freely estimated (i.e., unconstrained during the calibration). An anchor calibration results in all new items being automatically calibrated to the underlying item bank θ scale in a single calibration. Iterative procedures are sometimes also employed to purify the linking set of items—for example, removing items that exhibit statistical parameter estimation drift or that otherwise exhibit large residuals between the observed data and model-based estimated response functions. When pre-equating works, it makes possible an integrated system for itempretesting, calibration, linking, test form assembly, and scoring. Figure 11.3 depicts this type of system at a fairly high level of abstraction. Efficient and effective sampling strategies control how examinees enter the system form calibration purposes. Pretesting, calibration and linking ensure that all items in the item bank are on a common scale. The calibrated item parameter estimates can be extracted from the item bank for scoring—that is, for computing individual IRT estimates of θ or to generate number-correct look-up tables for future use. The calibrated item statistics can also be used to assemble new test forms. That latter capability is addressed next as an enormously powerful application of IRT. This type of calibration, test-form assembly, and scoring system is important because it implies that, if the usual IRT dimensionality and probabilistic independence assumptions hold, we can: (a) build as many test forms as needed— subject to exhausting the test item content requirements and staying within acceptable item-exposure/reuse levels; and (b) score any examinee on any test form provided that the items have been calibrated to the bank scale. This opens up the possibility of building many different test forms from the item bank and using calibrated item statistics to estimate examinees’ scores on the base score scale —what Lord (1977, 1980) envisioned as pre-equating. However, it also demands mention that IRT is only successful if the item bank contains stably calibrated statistics and if the scale properties are monitored over time. Ultimately, if we can improve the quality of those IRT-calibrated item statistics relative to the item-bank scale we may be improving the quality of the score scale, itself.

Information Functions for Test Design and Assembly The concept of conditional measurement information was first introduced by Birnbaum (1968). However, it was Lord (1977, 1980) who realized the enormous potential of having a conditional item-level statistic directly related to measurement error. His intuitions were even apparent in the 1950’s, when he

FIGURE 11.3

Reported Proficiency Scale, y

Bank Proficiency Scale, θ

IRT as part of a comprehensive item-bank calibration and scoring enterprise

Scaling Procedures

Bank of Calibrated Item Parameter Estimates, ξ

IRT Item Calibration and Linking Procedures

Raw Response Data, U

Sampling Mechanisms, Data Collection and Raw Scoring

Examinee Population

Test-Form Assembly Specifications

Test-Form Assembly Mechanisms and Form Publication/Compilation

Live Test Forms

250 Richard M. Luecht and Ronald K. Hambleton

Item Response Theory 251

demonstrated the contribution of test items to conditional measurement errors under the normal ogive model (Lord, 1953b). IRT information for a single binary-scored item can be expressed as Ii ð Þ ¼ @Pi 2 =Pi Qi

ð24Þ

where Pi is 1PL, 2PL or 3PL model response function (Birnbaum, 1968). Birnbaum further demonstrated that the conditional measurement error variance of the estimates of the latent score, θ, is inversely proportional to the test information function (TIF), denoted I(θ). That is, ^ ¼ I ð Þ1 ¼ varð j Þ

Pn i

Ii ð Þ

ð25Þ

where Ii(θ) is the item information function. Since the conditional item information functions are additive, we can define a target TIF to denote where and how much measurement precision we want along the θ-scale. Figure 11.4 displays the item information and TIF functions for a 10-item test calibrated using the 3PL model. The more peaked an information curve is, the more precision the item or test contributes within that peaked region of the score scale. Correspondingly, higher information implies smaller errors of measurement within the same region of the scale. The left-hand image in Figure 11.4 shows that the information available in the ten items is spread across reasonably wide range of the θ scale. The right-hand image shows that, when the item information functions are added at each θ value, the TIF shows overall peak information roughly for 0 2. If we had the highest concentration of examinees in that region of the scale or if we needed to make a critical classification decision such as issuing a license or valued professional certificate near θ = 1, this test would be optimally designed. Lord (1977) was also one of the first to suggest selecting items to either meet a target TIF (also see Samejima, 1977) or to maximize the item information Item 1 Item 2 Item 3 Item 4 Item 5 Item 6 Item 7 Item 8 Item 9 Item 10

Item and Test Information

1.2 1.0 0.8 0.6 0.4 0.2 0.0 –3

–2

–1

0 θ

FIGURE 11.4

1

2

3 –4

–3

–2

–1

0

1

2

3

4

θ

Item and test information functions for a 10-item test (3PL-calibrated)

252 Richard M. Luecht and Ronald K. Hambleton

amassed at a particular examinee’s apparent score estimate—a premonition of item-level computerized adaptive testing and multistage testing. Two related testdevelopment applications emerged from Lord’s suggestions. Both have been greatly enhanced over the past three decades. The first involves using item and test information functions for test design (also see Luecht, 2015). The second involves using linear optimization algorithms to select items to either maximize information or to meet a designed target TIF. An additional use of IRT for test development involves test assembly. A test assembly can be represented by a linear model denoting an item selection process that chooses n items while simultaneously satisfying content requirements and some psychometric goal such as meeting a target TIF or maximizing the conditional precision of a test at some point along the θ scale (van der Linden, 2005). When the linear model is solved by optimization software, it becomes automated test assembly (ATA). Letting I(θ) denote a target TIF and following van der Linden (2005), we can write the optimizing ATA item-selection model as minimize y (26) PI subject to: x I ð Þ I ð Þ þ y for k=1,…,q (27) i i k i¼1 PI for k=1,…,q (28) i¼1 xi Ii ð k Þ I ð Þ  y PI i¼1

xi ¼ n

ð29Þ

xi 2 f0; 1g;

ð30Þ

where n is the test length and xi is binary decision coefficient (i.e., xi=1 if item i is selected or zero, otherwise). Additional content and other constraints can be incorporated into the ATA item-selection model. ATA solvers range from very fast approximately optimal algorithms that can handle extremely large models and many simultaneous test forms to linear and nonlinear optimization solvers that provide globally optimal—that is, the best-possible—solution (see van der Linden, 2005; Luo, 2020). The item-selection part of adaptive testing and top-down multistage test assembly can make use of a similar ATA model by replacing Equations 27 to 29 with a single objective function, Maximize PI i¼1

xi Ii ð Þ

ð31Þ

at a specified score point, θ, or provisional estimate of θ (for real-time item selection and scoring). These two basic ATA models can be generalized to almost any type of computer-based test delivery model (van der Linden, 2005; Luecht, 2013) merely varying the implementation of that item-selection process. Figure 11.5

Calibrated Item Bank high +2.0

Proficiency Scale

+1.0

Route Information Functions

Item Response Theory 253

M1+E2 M1+M2 M1+H2

20

Unit D3

15

Unit D1

10

Unit D2

5 0 –4

–2 0 2 Proficiency Scores, θ

4

Unit M3 Unit M1

Unit M2

Item Selection & Test Assembly

–1.0 Unit E3 –2.0 low

FIGURE 11.5

Content Specifications Content Area A Content Area B Content Area C

Unit E1

Unit E2

Test assembly of multiple forms for three different target TIFs

demonstrates the relationship between a calibrated item bank, ATA (test form content specifications, TIF targets, and an optimization solver) and the test forms. In this depiction, three different target TIFs are used. Units E1 to E3 are the easier test forms (units), units M1 to M3 are the moderately difficult test forms and units D1 to D3 are the most difficult test forms.

Applications of IRT for Adaptive and Multistage Testing Adaptive testing has been around since the 1980’s (see Lord, 1980; Weiss & Davidson, 1981; Kingsbury & Weiss, 1983). But it became more feasible as the first wide-area networks and the internet applications emerged in the mid-1990’s (e.g., Eignor, Stocking, Way, & Steffen, 1993; Mills & Stocking, 1996; Zara, 1994; Sands, Waters, & McBride, 1997). An item-level adaptive test is really the combination of two separate processes: (a) a scoring procedure that estimates the latent scores in real-time and (b) an item-selection process that chooses one or more items to maximize the TIF at an examinee’s individual provisional score. There are a number of variations of “CAT”, but the basic test delivery model starts the test with a small number of items to initialize scoring. The CAT then proceeds until a particular stopping rule such as fixed test length, minimized error variance requirement, or decision utility criterion is met (see Sands, et al 1997; van der Linden & Glas, 2010). Figure 11.6 shows 50-item simulated CATs for five examinees with true abilities of ¼ ð2; :75; 0; 1:25; 2:5Þ. The left-hand image shows the ability estimates of θ, which eventually converge to their true values. The error bars denote the standard errors of estimate (SE=square root of Equation 25). As more items are

254 Richard M. Luecht and Ronald K. Hambleton 3

Proficiency Scale, θ

2 1

θ1= –2 θ2= –0.75 θ3= –0 θ4= 1.25 θ5= 2.5

0 –1 –2 –3 0

10

20

30

CAT Item Sequence

FIGURE 11.6

40

50 0

10

20

30

40

50

CAT Item Sequence

50-Item CATs for five examinees (Item Bank: I = 600 3PL-Calibrated Items)

administered, the SEs get smaller. The right-hand image shows the location of maximum information for the 50 selected items per examinee from the 600-item bank. Unfortunately, MLEs proved to be far less stable when the 2PL or 3PL models were used in CAT—especially during the early part of the CAT item-selection sequence when the number of items failed to produce a well-conditioned likelihood function. Two solutions were proposed in the 1980’s both involving Bayesian estimates. Mislevy and Bock (1982) initially proposed using EAPs—that is, the expected value or mean of the posteriori distribution (Equation 18). Mislevy (1986) later introduced a MAP score estimator (i.e., the mode of the posterior likelihood function. Another variation on item-level CAT was demonstrated to be feasible by mid- to late-1990’s. Lord (1980) and others had already suggested the concept multistage testing (MST). But a feasible way of creating and deploying auto-adaptive panels composed of pre-assembled modules did not occur until the next decade (see, for example, Yan & von Davier, 2019, for a more complete historical perspective). Figure 11.7 shows five different MST panel configurations. The panels are a pre-assembled adaptive test that can route an examinee to easier or harder modules at each stage. The upper row in Figure 11.7 shows two 2-stage panel configurations. The lower row presents three 3-stage configurations. Modules can be small to moderate sized units comprised of discrete items, item sets, or even complex computerized performance exercises. The pre-assembly aspect of MST improves test-form quality control for real-time adapted testing and can also simplify scoring and routing mechanisms (Luecht & Nungester, 1998; Luecht, 2014, 2016). Using several different ATA modeling approaches, Luecht and Nungester (1998) and Luecht, Brumfield, and Breithaupt (2006) were able to demonstrate successful implementations of the first large-scale MST panel assembly methods and operational feasibility.

Item Response Theory 255

H2 E2

E2

H2

M2

H2

Multiple Panels M1

M1

E2

H2

E2

H2

M1 FIGURE 11.7

E3

M3

ME2

H3

ME3

VE3

MH2

M1

E2

M2

MH3

VH3

H2

M1

Examples of five MST panel configurations

Conclusions Our three-part story is not complete. First, it is simply not possible to fully tell the complete and fascinating history of IRT research, models, and applications in a single book chapter or even in a single book. Some topics like data-model fit and parameter invariance could have been given a more expansive treatment— especially regarding their historical development. Second, IRT applications are now developing so quickly and in such variety that they may be outpacing the research, development of operational software resources, and our capabilities to verify the utility of those new applications. That said, there are four broad classes of IRT applications that are likely to flourish in the future. The first is the application of multidimensional and constrained latent class models to non-cognitive multidimensional constructs in psychology and personality assessment as well as health care and quality of life. Much of the IRT research literature has been dominated by educational tests using dichotomously or polytomously scored data. However, there are psychological, personality and diagnostic assessment contexts that use novel item response formats and that are inherently multidimensional, requiring a profile of scores. A second class of applications will undoubtedly involve new developments in computerized adaptive and multistage testing. For example, ongoing, principled-item design research is making it possible to focusing item calibration on an entire family of items, rather than individual items. In an adaptive context, that implies routing to different item families and selecting from within the family. IRT calibration designs and scoring can also benefit from the hierarchical structure of the item families (e.g., Geerlings, et al, 2011; Sinharay & Johnson, 2012). The concept of auto-adaptive items is also a possibility where item-design features are calibrated and manipulated to optimally alter each item for individual examinees—possibly in different settings.

256 Richard M. Luecht and Ronald K. Hambleton

A third class will certainly involve new types of response and process data available from structured and unstructured assessment-related tasks, workflows, or computer log files. The extent of new modeling and estimation procedures required to contend with these different types of data is both exciting and a bit daunting. Finally, more efficient and effective data visualization methods are becoming available for score reporting, for fit and exploration of parameter invariance, and for calibration and scoring quality assurance applications. There is little doubt that many aspects of data science and data visualization will find its way to psychometrics. So, although this historical story is indeed incomplete, it is important to realize that the future will someday become part of that same history. It is intriguing, challenging and thrilling to contemplate those future epochs in IRT theory, research, and applied practice.

Notes 1 The authors would like to thank George Engelhard for sharing his insights about the history of Rasch measurement and the many significant psychometric contributions by faculty and former students at the University of Chicago. They would also like to thank Brian Clauser, Jerome Clauser, and Peter Baldwin for their assistance in editing the final version of this chapter. 2 Note that Lord later began to refer to latent trait theory as item characteristic theory,and still later, as item response theory; Lord, 1977, 1980.) 3 Lord’s notation is changed here to match the more conventional IRT notational in use today. 4 Accessible implies that: (a) the software can be used on a non-proprietary basis on a local mainframe, mini- or micro-computers (possibly with reasonable software licensing fees), and (b) that there is a users’ guide with examples and coherent interface (graphical and/ or comma-driven). 5 Although hinted at in his 1977 paper, Lord subsequently used a very powerful recursive algorithm for estimating the unconditional distribution of observed scores (Lord & Wingersky, 1984). 6 See Kolen and Brennan (2014) for a very complete treatment of equating designs, methods, and applications—including IRT equating methods. 7 We are pre-supposing non-equivalent groups taking the test form. Within the same test administration window, test forms can be randomly assigned to examinees (or otherwise constructed in real-time with randomization included in the item selection mechanism) as part of a randomly equivalent groups linking design (e.g., Kolen & Brennan).

References Ackerman, T. A. (1992). A didactic explanation of item bias, item impact, and item validity from a multidimensional perspective. Journal of Educational Measurement, 29, 67–91. Ackerman, T. A. (2005). Multidimensional item response theory modeling. In A. MaydeuOlivares & J. J. McArdle (Eds.), Contemporary psychometrics, (pp. 1–26). Mahwah, NJ: Lawrence Erlbaum Associates.

Item Response Theory 257

Andersen, E.B. (1973). A goodness of fit test for the Rasch model. Psychometrika 38, 123–140. Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561–573. doi:10.1007/BF02293814. Andrich, D. (2016). Rasch rating-scale model. Handbook of item response theory, volume one: Models (pp. 75–94). New York: Chapman & Hall/CRC. Berkson, J. (1953). A statistically precise and relatively simple method of estimating the bio-assay with quantal response, based on the logistic function. Journal of the American Stastistical Association, 48, 565–599. Berkson, J. (1957). Tables for the maximum likelihood estimate of the logistic function. Biometrics, 13, 28–34. Birnbaum, A. (1958a). On the estimation of mental ability (Series Rep. No. 15, Project No. 7755–7723). Randolph Air Force Base, TX USAF School of Aviation Medicine. Birnbaum, A. (1958b). Further considerations of efficiency in tests of a mental ability (Tech. Rep. No. 17, Project No. 7755–7723). Randolph Air Force Base, TX: USAF School of Aviation Medicine. Birnbaum, A. (1967). Statistical theory for logistic mental test models with a prior distribution of ability (Research Bulletin No. 67-12). Princeton, NJ: Educational Testing Service. Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical Theories of Mental Test Scores (pp. 397–479). Reading, MA: Addison-Wesley. Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37, 29–51. Bock, R. D. (1997). A brief history of item response theory. Educational Measurement: Issues and Practice, 16 (4), 21–33. Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: application of an EM algorithm. Psychometrika, 46, 443–445. Bock, R. D., Gibbons, R., & Muraki, E. (1988). Full-information item factor analysis. Applied Psychological Measurement, 12, 261–280. Bock, R. D., & Lieberman, M. (1970). Fitting a response model for n dichotomously scored items. Psychometrika, 35, 179–197. Bock, R. D., & Mislevy, R. J. (1982). Adaptive EAP estimation of ability in a microcomputer environment. Applied Psychological Measurement, 6, 431–444. Bock, R. D., & Zimowski, M. F. (1995). Multiple group IRT. In W. van der Linden & R. K. Hambleton (Eds.), Handbook of item response theory (pp.433–448). New York: Springer-Verlag. Cai, L. (2010). High-dimensional exploratory item factor analysis by a Metropolis–Hastings Robbins–Monro algorithm. Psychometrika, 75, 33–57. Cai, L. (2013a). flexMIRT Version 2.0: Flexible multilevel item analysis and test scoring (Computer software). Chapel Hill, NC: Vector Psychometric Group LLC. Cai, L. (2013b). Three cheers for the asymptotically distribution free theory of estimation and inference: Some recent applications in linear and nonlinear latent variable modeling. In R. C. MacCallum, & M.C. Edwards (Eds.), Current Topics in the Theory and Application of Latent Variable Models. New York: Taylor & Francis. Cai, L. (2015). IRTPRO: Flexible, multidimensional, multiple categorical IRT modeling. In W. J. van der Linden (Ed.), Handbook of item response theory: Vol. 3. Boca Raton, FL: Chapman & Hall/CRC. Cai, L., & Thissen, D. (2014). Modern approaches to parameter estimation in item response theory. In S. P. Reise & D. Revicki (Eds.), Handbook of Item Response Theory Modeling: Applications to Typical Performance Assessment. New York: Taylor & Francis.

258 Richard M. Luecht and Ronald K. Hambleton

Cai, L., Thissen, D., & du Toit, S. H. C. (2011). IRTPRO: Flexible, multidimensional, multiple categorical IRT modeling (Computer software). Chicago, IL: Scientific Software International. Carlson, J. E., & von Davier, M. (2017). Item response theory. In R. E. Bennett & M. von Davier (Eds.). Advancing human assessment: The methodological, psychological, and policy contributions of ETS (pp. 133–178). New York: Springer Open Book. https://doi.org/10. 1007/978-3-319-58689-2_5. Camilli, G. (1994). Origin of the scaling constant d=1.7 in item response theory. Journal of Educational and Behavioral Statistics, 19, 293–295. Camilli, G. (2006). Test fairness: In R. L. Brennan (Ed.), Educational measurement, 4th edition, (pp. 220–256). Westport, CT: American Council on Education. Cohen, A. S., & Cho, S-J. (2016). Information criteria. In W. J. van der Linden (Ed). Handbook of item response theory, volume two: Statistical tools, (pp. 363–378). New York: CRC Press/Chapman Hall. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with Discussion). Journal of the Royal Statistical Society, Series B, 39, 1–38. Drasgow, F. (Ed.) (2016). Testing and technology. New York: Routledge. du Toit, M. (Ed). (2003). IRT from SSI. [Computer programs] Eignor, D. R., Stocking, M. L., Way, W. D., & Steffen, M. (1993). Case studies in computer adaptive test design through simulation (RR-93-56). Princeton, NJ: Educational Testing Service. Engelhard, G. (1984). Thorndike, Thurstone and Rasch: A comparison of their methods of scaling psychological and educational tests. Applied Psychological Measurement, 8, 21–38. Engelhard, G. (2008). Historical perspectives on invariant measurement: Guttman, Rasch, and Mokken, Measurement, 6, 155–189. doi:10.1080/15366360802197792. Engelhard, G. (2013). Invariant measurement: Using Rasch models in the social, behavioral, and health sciences. New York: Routledge. Faulkner-Bond, M., & Wells, C. S. (2016). A brief history of and introduction to item response theory. In M. Faulkner-Bond & C. S. Wells (Eds.), Educational measurement: From foundations to future, (pp. 107–125). New York: The Guilford Press. Finney, D. J. (1947). Probit analysis. Cambridge, UK: Cambridge University Press. Fisher, R. A. (1925). Theory of statistical estimation. Proceedings of the Cambridge Philosophical Society, 22, 699–725. Fisher, R. A. & Yates, F. (1938). Statistical tables for biological, agricultural and medical research. Oxford, UK: Oliver & Boyd. Geerlings, H., van der Linden, W. J., & Glas, C. A. W. (2011). Modeling rule-based item generation. Psychometrika, 76, 337–359. Glas, C. A. W. (2016). Frequentist model-fit tests. In W. J. van der Linden (Ed). Handbook of item response theory, volume two: Statistical tools, (pp. 343–362). New York: CRC Press/ Chapman Hall. Haley, D. C. (1952). Estimation of the dosage mortality relationship when the dose is subject to error. Stanford: Applied Mathematics and Statistics Laboratory, Stanford University, Technical Report 15. Hambleton, R. K., & Cook, L. L (1977). Latent trait models and their use in the analysis of educational test data. Journal of Educational Measurement, 14(2), 75–96. Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Boston: Kluwer-Nijhoff Publishing.

Item Response Theory 259

Hambleton, R. K.; Swaminathan, H.; & Rogers, J. (1991). Fundamentals of item response theory. Thousand Oaks, CA: Sage Publications. Hanson, B., & Béguin, A. (2002). Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common-item equating design. Applied Psychological Measurement, 26, 3–24. Holland, P. W., & Wainer, H. (Eds.). (1993). Differential item functioning. Hillsdale, NJ: Lawrence Erlbaum. Kingsbury, G. G., & Weiss, D. J. (1983). A comparison of IRT-based adaptive mastery testing and a sequential mastery testing procedure. In D. J. Weiss (Ed.), New horizons in testing, (pp. 257–286). New York: Academic Press. Kolen, M. J., & Brennan, R. L. (2014). Test equating, scaling and linking: Methods and practices (3rd edition). New York: Springer. Lawley, D. N. (1943). On problems connected with item selection and test construction. Recordings of the Royal Society of Edinburgh, 61, 273–287. Lazarsfeld, P. F. (1950). The logical and mathematical foundation of latent structure analysis. In S. A. Stouffer, et al. (Eds.) Measurement and Prediction, Vol. 4 of Studies in Social Psychology in World War II, Chapter 10. Princeton, NJ: Princeton University Press. Linacre, M. (2019). Winsteps Rasch measurement. [Computer program]. WinSteps.com Lord, F. M. (1952). A theory of test scores. Psychometric Monograph, No. 7. Lord, F. M. (1953a) The relation of test score to the trait underlying the test. Educational and Psychological Measurement, 13(4), 517–549. Lord, F. M. (1953b). An application of confidence intervals and maximum likelihood to the estimation of an examinee’s ability. Psychometrika, 18, 57–75. Lord, F. M. (1977). Practical applications of item characteristic curve theory. Journal of Educational Measurement, 14(2), 117–138. Lord, F. M. (1980). Applications of item response theory to practical testing problems. Mahwah, NJ: Lawrence Erlbaum and Associates. Lord, F. M., & Novick, M. (1968). Statistical theories of mental test scores. Reading, PA: Addison-Wesley. Lord, F. M., & Wingersky, M. S. (1984). Comparison of IRT true-score and equipercentile observed-score “equatings”. Applied Psychological Measurement, 8; 453–461. Luecht, R. M. (2014). Computerized adaptive multistage design considerations and operational issues. In D. Yan, A. A. von Davier & C. Lewis (Eds.), Computerized Multistage Testing: Theory and Applications, (pp. 69–83). London, UK: CRC Press/ Taylor & Francis Group. Luecht, R. M. (2015). Applications of item response theory: Item and test information functions for designing and building mastery tests. In S. Lane, M. Raymond & T. Haladyna (Eds.), Handbook of test development, 2nd edition, (pp. 485–506). New York: Routledge. Luecht, R. M. (2016). Computer-based test delivery models, data, and operational implementation issues. In F. Drasgow (Ed.), Technology and testing, (pp. 179–205). New York: Routledge. Luecht, R. M, Brumfield, T., & Breithaupt, K. (2006). A testlet assembly design for the uniform CPA Examination. Applied Measurement in Education, 19, 189–202. Luecht, R. M., & Nungester, R. J. (1998). Some practical applications of computerized adaptive sequential testing. Journal of Educational Measurement, 35, 229–249. Luo, X. (2020). Automated test assembly with mixed-integer programming: The effects of modeling approaches and solvers. Journal of Educational Measurement, 57, 547–565.

260 Richard M. Luecht and Ronald K. Hambleton

Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–174. Masters, G. N., & Wright, B. (1984). The essential process in a family of measurement models. Psychometrika, 49, 529–544. McDonald, R. P. (1999). Test theory: A unified treatment. Mahwah, NJ: Lawrence Erlbaum Associates. Mills, C. N., & Stocking, M. L. (1996). Practical issues in large-scale computerized adaptive testing. Applied Measurement in Education, 9, 287–304. Mislevy, R. J. (1986). Bayes modal estimation in item response models. Psychometrika, 51, 177–195. Mislevy, R. J., & Bock, R. D. (1983). BIL0G: Analysis and scoring of binary items and one-, two-, and three-parameter logistic models. Chicago, IL: Scientific Software International. Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 29, 159–176. Neyman, J. A., & Scott, E. L. (1948). Consistent estimates based on partially consistent observations. Econometrika, 16, 1–22. Orlando, M., & Thissen, D. (2000). Likelihood-based item-fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24, 50–64. Penfield, R. D., & Camilli, G. (2006). Differential item functioning and item bias. In C. R. Rao & S. Sinharay (Eds.). Handbook of Statistics: Volume 26: Psychometrics, (pp. 125– 167). New York: Elsevier. Petersen, N. S., Cook, L. L., & Stocking, M. L. (1983). IRT versus conventional equating methods: A comparative study of scale stability. Journal of Educational Statistics, 8, 137–156. Rasch, G. (1960, 1980). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research. Republished in 1980 with Foreword and Afterword by B. Wright, Chicago, IL: MESA Press/The University of Chicago. Rasch, G. (1961). On general laws and meaning of measurement in psychology. Proceedings of the Fourth Berkley Symposium on Mathematical Statistics and Probability, (pp. 321–333). Berkley, CA: University of California Press. R Core Team (2017). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/. Samejima, F. (1969). Estimation of latent trait ability using a response pattern of graded scores. Psychometrika Monograph Supplement, No. 17. Samejima, F. (1972). A general model for free-response data. Psychometrika, Monograph Supplement, No. 18. Samejima, F. (1977). Weakly parallel tests in latent trait theory with some criticisms of classical test theory. Psychometrika, 42, 193–198. Sands, W. A., Waters, B. K., & McBride, J. R. (Eds.) (1997). Computerized adaptive testing: from inquiry to operation. Washington, DC: American Psychological Association. Sinharay, S. (2016). Bayesian fit and model comparison. In W. J. van der Linden (Ed). Handbook of item response theory, volume two: Statistical tools, (pp. 379–394). New York: CRC Press/Chapman Hall. Sinharay, S., & Johnson, M. S. (2012). Statistical modeling of automatically generated items. In M. J. Gierl & T. M. Haladyna (Eds.), Automatic item generation (pp. 183–195). New York: Routledge. Stocking, M. L., & Lord, F. M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7, 201–210. Swaminathan, H., & Gifford, J. A. (1982). Bayesian estimation in the Rasch model. Journal of Educational Statistics, 7, 6175–6191.

Item Response Theory 261

Swaminathan, H., & Gifford, J. A. (1985). Bayesian estimation in the two-parameter logistic model. Psychometrika, 50, 349–364. Swaminathan, H., & Gifford, J. A. (1986). Bayesian estimation in the three-parameter logistic model. Psychometrika, 51, 589–601. Thissen, D. (1982). Marginal maximum likelihood estimation for the one-parameter logistic model, Psychometika, 47, 175–186. Thissen, D. (1983, 1991). MULTILOG: multiple category item analysis and test scoring using item response theory. Chicago, IL: Scientific Software International Thissen, D., & Steinberg, L. (1984). A response model for multiple-choice items. Psychometrika, 49, 501–519. Thissen, D., & Steinberg, L. (1986). A taxonomy of item response models. Psychometrika, 51, 566–577. Thissen, D., & Steinberg, L. (2020). An intellectual history of parametric item response theory models in the twentieth century, Chinese/English Journal of Educational Measurement and Evaluation, 1(1), Article 5. Thurstone, L. L. (1925). A method of scaling psychological and educational tests. Journal of Educational Psychology, 16, 433–451. Tucker, L. R. (1946). Maximum validity of a test with equivalent items. Psychometrika, 11, 1–13. Urry, V. W. (1977). Approximations to item parameters of mental test models and their uses. Educational and Psychological Measurement, 34, 253–269. van der Linden, W. J. (2005). Linear models for optimal test design. New York: Springer. van der Linden, W. J. (2016). Introduction. Handbook of item response theory, volume one: Models, (pp. 1–10). New York: Chapman & Hall/CRC Statistics in the Social and Behavioral Sciences Series. van der Linden, W. J. (Ed.) (2016). Handbook of item response theory, volume three: Applications. New York: Chapman & Hall/CRC Statistics in the Social and Behavioral Sciences Series. van der Linden, W. J., & Glas, C. E. W. (Eds.) (2010). Elements of adaptive testing. New York: Springer. van der Linden, W. J., & Hambleton, R. K. (Eds.). (1997). Handbook of modern item response theory. New York: Springer. Wainer, H., Bradlow, E. T., & Wang, X. (2007). Testlet response theory and its applications. New York: Cambridge University Press. doi:10.1017/CBO9780511618765. Warm, T. A. (1978). A primer of item response theory. Technical Report 941078. (Available online from https://apps.dtic.mil/sti/pdfs/ADA063072.pdf). Warm, T. A. (1989). Weighted likelihood estimation of ability in item response theory. Psychometrika, 54, 427–450. Weiss, D. J., & Davidson, M. L. (1981). Test theory and methods. Annual Review of Psychology, 629–658. Wells, C. S., & Hambleton, R. K. (2016). Model fit with residual analysis. In W. J. van der Linden (Ed). Handbook of item response theory, volume two: Statistical tools, (pp. 395– 424). New York: CRC Press/Chapman Hall. Wingersky, M. S., Cook, L. L., & Eignor, D. R. (1987). Specifying the characteristics of linking items used for item response theory item calibration [Computer program]. ETS Research Report 87-24. Princeton, NJ: Educational Testing Service. Wood, R. L., & Lord, F. M. (1976). A user’s guide to LOGIST. Research Memorandum 76-4. Princeton, NJ: Educational Testing Service.

262 Richard M. Luecht and Ronald K. Hambleton

Wood, R. L., Wingersky, M. S., & Lord, F. M. (1976). LOGIST: A computer program for estimating examinee ability and item characteristic curve parameters. Research Memorandum 76-6. Princeton, NJ: Educational Testing Service. Wright, B. D. (1977). Solving measurement problems with the Rasch model. Journal of Educational Measurement, 14(2), 97–116. Wright, B. D., & Linacre, M. (1983). MSCALE. [Computer program]. Chicago, IL: University of Chicago. Wright, B. D. & Linacre, J. M. (1991) BIGSTEPS: Rasch analysis computer program. Chicago, IL: MESA Press. Wright, B. D., & Linacre, M. (1998). WinSteps. [Computer program]. Chicago, IL: MESA Press, University of Chicago. Wright B. D., & Mead, R. J. (1978). BICAL: Calibrating rating scales with the Rasch model. Research Memorandum No. 23. Chicago, IL: Statistical Laboratory, Department University of Chicago. Wright, B. D., & Panchapakesan, N. (1969). A procedure for sample-free item analysis. Educational and Psychological Measurement, 29, 23–48. Wright, B. D., & Stone, M. H. (1979). Best test design. Chicago, IL: MESA Press. Yen, W. M. (1984). Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 8, 125–145. Yen, W. M. (1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of Educational Measurement, 30, 187–213. Zara, A. R. (1994, March). An overview of the NCLEX/CAT beta test. Paper presented at the meeting of the American Educational Research Association, New Orleans. Zenisky, A. L., & Luecht, R. M. (2016). The future of computer-based testing. In C. S. Wells & M. Faulkner-Bond (Eds.), Educational measurement: From foundations to future, (pp. 221–238). New York: The Guilford Press.

12 A HISTORY OF SCALING AND ITS RELATIONSHIP TO MEASUREMENT Derek C. Briggs1

Scaling is an activity with different meanings. For a mountaineer, it is synonymous with climbing; for the angler, it is synonymous with cleaning a fish. For the psychometrician, scaling would appear to be synonymous with measurement. But in what sense is this the case? Let us define measurement—for the moment at least—in its most traditional sense, as the estimation of the ratio of a magnitude of a quantity to a unit of the same quantity. It follows from this that if something is measurable, it can be expressed as a real number. The interpretation of that number depends upon the scale on which it is expressed. If I tell you that the heights of the two closed books laying on the table next to me are 2.7 and 3.9 centimeters respectively, as long as you have some internalized referent for the length of a centimeter, you might immediately have some sense for the thickness of each of the two books as well as the difference between them. It would seem most sensible to use centimeters as the unit for the measuring scale because I am recording estimates of length by visual inspection using a simple graduated ruler. The effective range of my scale for measuring book thickness would span some fraction of a centimeter and about 10 centimeters. If finer distinctions were desired, a different instrument and procedure would be needed, in which case it might be possible to express differences in terms of millimeters. Similar reasoning could be used to justify a choice of scale when measuring the duration of an event. Depending upon the nature of the event and the accuracy of the instrumental procedure, the scale could be expressed as a ratio of seconds, minutes, hours or days. The activity of “scaling” as I describe it here goes hand in hand with the activity of measurement in the sense that the meaning of the scale depends upon the choice of unit, and the utility of the scale depends upon the fineness of the distinctions that need to be made. The choice of scale may seem straightforward when measurement is practiced in the physical sciences, but this is only because so much work has been invested

264 Derek C. Briggs

behind the scenes to ensure that units for measurement have been standardized, and therefore do not depend upon the objects being measured, or the instrument being used to measure. For example, both the meter and the second are two of the seven base units in the International System of Units (SI). The second is the duration of 9,192,631,770 periods of the radiation corresponding to the transition between the two hyperfine levels of the ground state of the caesium-133 atom. In turn, the meter is defined relative to duration as the length of the path travelled by light in a vacuum, which is 1/299,792,458 of a second. The ability to realize and reproduce all the base and derived units of the SI in controlled experimental settings is the reason that all the different instruments used to measure length and duration can be calibrated to a common standard. In the physical sciences then, scaling is inextricable from measurement, and it also plays the important role of taking measurement out of the private domain of a scientific enterprise into the public domain of everyday use and interpretation. Frederic Lord opened his 1954 literature review of scaling written for the journal Review of Educational Research with the claim that even in educational and psychological contexts, scaling is “virtually indistinguishable” from measurement (Lord, 1954, 375). Was Lord right? This probably depends upon how one goes about answering the following questions: In what sense can an achievement test be conceptualized as a measurement procedure? What is the attribute of the student that is being measured? Does it exist on a continuum? At what point in the procedure does scaling take place? Does it happen before the student sits for the test or afterward? What is being scaled and in what unit? Is it the student that is to be located on the scale, the items on the test, or both? Finally, what are the properties of the scale? Does the scale only support comparisons between students or items with respect to their order on the attribute of interest, or does it also support comparisons in terms of distance? In this chapter we will delve into some of the historical attempts that have been made to answer these questions in psychology and education. To this end I will introduce a framework motivated by Torgerson (1958) to distinguish between three major approaches to scaling. Historically, the oldest approach has its roots in the tradition of psychophysics and the pioneering work of Ernst Weber and Gustav Fechner in the mid-19th century. The psychophysics tradition represents a stimuluscentered approach to scaling. A second scaling approach can be traced back to Francis Galton’s late 19th century methodological contributions to the study of individual differences. The individual differences tradition represents a subject-centered approach to scaling. Finally, a third scaling approach was born from the attempt to locate both stimuli and subjects on a common numeric scale. Although the methods within this approach were only fully formalized between 1950 and 1970, the impetus for them was anticipated by the work of Louis Thurstone in the 1920s. This third approach represents a response model approach to scaling. The sections of the chapter proceed as follows. In the next section I present a framework to facilitate an understanding of important distinctions between the

A History of Scaling 265

theory and methods of scaling. The following three sections consider the origins of the stimulus-based scaling approach with respect to Fechner’s paradigm of psychophysical measurement, the origins of the subject-based scaling approach with respect to Galton’s paradigm of relative measurement, and, briefly the origins of the response model scaling approach of Louis Thurstone. In the last two sections of the chapter, I use examples from the more contemporary literature on the scaling of educational achievement tests to examine the extent to which measurement and scaling—in contrast to Lord’s assertion—are currently being positioned as distinct and potentially independent activities. I point out some conceptual tensions this has introduced.

A Conceptual Framework for Theory and Methods of Scaling In 1950, the Social Science Research Council in the United States formed the Committee on Scaling Theory and Methods, comprised of Harold Gulliksen (chair), Paul Horst, John Karlin, Paul Lazarsfeld, Henry Margenau, Frederick Mosteller and John Volkmann. Warren Torgerson, one of Gulliksen’s former graduate students, was invited by the committee to review and summarize material on psychological scaling. The resulting monograph became a comprehensive treatment of the topic, entitled Theory and Methods of Scaling2. A visual representation of a conceptual framework for scaling inspired by Torgerson’s book is depicted in Figure 12.1. The centerpiece of the framework features three scaling approaches that are possible in the context of the data available to a researcher when n subjects interact with m stimuli, and the results get recorded in an n by m response matrix. The goal of a stimulus-centered approach is to locate the m stimuli on a numeric scale. To accomplish this, each stimulus is judged repeatedly by the same subject, or independently by multiple subjects. The subject or subjects are selected for the purpose of discriminating amongst stimuli with respect to some targeted attribute, and as such are considered replications. In contrast, the goal of a subject-centered approach is to locate n subjects on a numeric scale. To the extent that each subject is presented with multiple stimuli, it is these stimuli that are considered replications, and the stimuli are selected for the express purpose of discriminating individual differences among subjects with respect to some targeted attribute. The third approach in Torgerson’s framework was new at the time of his book, and I describe it here as the response model approach. In this approach, rather than attempting to control for differences in the ability of subjects to discriminate amongst stimuli, or for differences in the ability of stimuli to allow for discriminations amongst subjects, the goal is to use the full n by m set of responses to locate both subjects and stimuli on a common numeric scale. Each of these three scaling approaches can be further distinguished by different methods that have been invented to attach numbers to subjects, stimuli or both. Examples of specific methods associated with each of these three scaling approaches are provided in Figure 12.1 along with some citations to the authors known for first developing them.

FIGURE 12.1

Normative Comparisons

Quantitative Judgment Methods

Fechner (1860) Thurstone (1925, 1927a, 1927b, 1927c) Torgerson (1952, 1954) Shepard (1962) Bock & Jones (1968)

Response Model (Both Subject and Stimulus)

Type of Attribute

Defined (by fiat)

Variability Judgment Methods

Stimulus-Centered

SCALING METHODS

SCALING THEORY

Thurstone (1928) Thurstone & Chave (1929) Comrey (1950) Stevens (1956, 1957, 1974)

Subject-Centered

Scale Properties

Kind of Measurement

Derived

A conceptual framework for theory and methods of scaling based on Torgerson (1958)

Galton (1883) Criterion-Referenced Ebel (1962) Comparisons Angoff (1971)

Galton (1875; 1889) Terman & Merrill (1937; 1960) Flanagan (1951) Gardner (1946, 1962) Lindquist & Hieronymus (1964)

Galton (1875; 1889) Pearson (1906) Hull (1922) Linear & McCall (1939) Nonlinear Score Kelley (1947) Transformations Angoff (1971)

Origin

Distance

Order

Fundamental

Probabilistic

Deterministic

Tucker (1948; 1952) Lord (1952; 1953) Rasch (1960) Birnbaum (1968) Andrich (1978)

Guttman (1944) Coombs (1950) Suppes & Zinnes (1963) Luce & Tukey (1964) Krantz et al (1971)

Latent Continuum

Latent Categories

Manifest Data

266 Derek C. Briggs

A History of Scaling 267

Assumptions Motivating Different Theories of Scaling The theoretical justification for different methods of scaling are a product of three different assumptions being made about (1) the type of attribute that is the target of measurement, (2) the kind of measurement procedure being invoked, and (3) the properties of the measurement scale that will result at the culmination of the procedure. Having a theoretical justification is important because it lends coherence to a given scaling method. It provides an answer to the question: under what conditions should approach X lead to intended outcome Y? The type of attribute being measured and its hypothesized structure represents a first core assumption underlying any scaling approach. The terms attribute and property are often used interchangeably to refer to some characteristic of an object. Torgerson defined an attribute as a measurable property of a person; measurable in the sense that we believe that the attribute exists, and that it exists in an amount that can be gradated. He also defined the magnitude of an attribute as a specific amount of an attribute, a point along a continuum of points3. Now, when an extensive physical attribute such as length or mass is being measured, there is no distinction to be made as to whether the continuum of points is manifest or latent. This is because the instrument itself is an instance of the attribute, and the scale pertains to the defined units of length or mass. But for measurement procedures in psychology and education, the distinction between manifest data and two types of latent continua is quite important. Is the measurer interested solely in some numeric representation of manifest data (e.g., responses to test or survey items), or in some representation of a hypothetical attribute that acts as a causal agent in determining the manifest data? If the attribute is to be located on a latent continuum, are the points along it to be conceptualized as the real numbers of a continuous quantity, or as the integers of a discrete quantity? A second assumption underlying a given scaling approach pertains to the kind of measurement procedure under which a scale is being developed. By a measurement procedure, I mean the approach that will be taken to establish an instrument and/or scale that can then be used to produce measures, as opposed to the act of measuring once the instrument and scale have already been established. Torgerson defined measurement as the assignment of a number to an object to represent an attribute of that object, a conceptualization that falls somewhere between the standard definition of measurement I provided at the outset of this chapter, and an even more general definition popularized by Stevens as “the assignment of numerals to objects and events according to rule.” (Stevens, 1946, 1951). To Torgerson, measurement was always specific to the attribute of an object, not the object itself, and hence, a requirement for measurement was the presumption of order. I will follow Torgerson in distinguishing between three kinds of measurement that could fall within his more general definition4. A fundamental measurement procedure is one in which the methods of the procedure are premised on a model

268 Derek C. Briggs

with falsifiable propositions about the attribute being measured. The term fundamental measurement had been originally introduced by the physicist Norman Campbell as a means by which numbers can be assigned to objects according to natural laws that do not presuppose measurement of any other variables (Campbell, 1920). Canonical examples of fundamental measurement are for the extensive attributes of length, mass and resistance. For these attributes it is possible to demonstrate that the attribute can be not only ordered, but additively deconstructed and reconstructed. Torgerson, however, used the term fundamental measurement more broadly than Campbell, such that it could apply to intensive properties in both the physical sciences (e.g., temperature and time) as well as in psychological and educational contexts (e.g., attitudes and abilities). What was necessary for this to be possible was a theory that could be established to give the attribute both a constitutive meaning (why some objects vary in the amounts of the attribute they contain) and an empirical meaning (the connection between this theorized variability and that which we can observe in the real world). The evolution in the scales and instruments used to measure temperature are often taken as a success story in this regard (see Sherry, 2011), and the early work in thermometry provides an example in which it is possible to establish the additivity of differences rather than levels. Torgerson’s more expansive use of fundamental measurement anticipated the conceptualization that would emerge from Suppes & Zinnes (1963) and Luce & Tukey (1964). A fundamental measurement procedure can be contrasted with a derived measurement procedure, in which an attribute is measured through the establishment of laws that relate the attribute to other attributes that can be measured fundamentally (Campbell, 1920). Density, measured as the ratio of mass to volume, is a classic example. Derived measurement procedures are rare to non-existent in the context of psychological attributes. But to the extent that a motivation for psychological measurement is to understand patterns of social behaviors, relationships and responses, there is no theoretical obstacle to posing hypotheses about social or psychological “laws” and establishing measurement procedures that are contingent on the truth of these laws. The problem is that it is much harder to separate the explanatory adequacy of the law from the adequacy of the primitives in the law and their measurement (see for example, Meehl, 1978). The third kind of measurement is Torgerson’s notion of measurement by fiat. This sort of measurement is likely to occur whenever we have a prescientific or common-sense concept that on a priori grounds seems to be important but which we do not know how to measure directly. Hence we measure some other variable or weighted average of other variables presumed to be related to it. (Torgerson, 1958; p. 22) Measurement by fiat depends upon presumed relationships between observations and some concept of interest. In contrast to a fundamental measurement, to

A History of Scaling 269

the extent that an attribute is being measured, it may not necessarily possess a constitutive meaning. In fact, it may only be defined by its operational connection to observable data, a connection premised on a fairly weak theory that depends largely on the intuition of the measurer. Historically, mental ability tests tend to at least begin as examples of measurement by fiat, but the same can be said of early attempts to measure temperature with a thermoscope. A common justification for measurement by fiat is that it is useful in the sense that the resulting measures facilitate decisions and actions that would not be possible in their absence. The third core assumption underlying a given scaling approach regards the intended properties of the measuring scale. That is, when numbers on the scale are attached to an object being measured, to what extent can numeric relationships amongst the objects be interpreted relative to the characteristics of order, origin and distance? A ratio scale requires all three characteristics, an interval scale requires order and distance, and ordinal scale just order. In a ratio scale the magnitude of an attribute is expressed as some multiple of a unit of the same attribute. Numbers on a ratio scale have the most restrictive mathematical group structure. For such scales, only mathematical transformations based on a multiplicative constant are acceptable in order to maintain the same empirical meaning about the attribute. An interval scale is one for which differences in magnitudes along the scale can be shown to be equal, but for which the choice of a zero point is arbitrary. It allows for a ratio scale of differences (the difference between any two points on the scale can be compared as a ratio to a reference distance), but not a ratio scale of magnitudes. Only linear transformations to an interval scale will leave the information conveyed about differences on any two points along the scale with the same meaning. An ordinal scale results from assigning numbers to objects to represent a common attribute among the objects that can be ordered. The numbers on an ordinal scale convey information about both equality and order, but differences in magnitudes among the numbers are not readily interpretable. Any order preserving (i.e., monotonic) transformation applied to the numbers of scale will retain the same information about order. Assumptions being made about the type of attribute, kind of measurement and intended property of scale are not the only distinguishing features between scaling approaches. The approaches can also differ by the nature of the responses that are being collected from subjects (i.e., whether subjects are being asked to compare or categorize) and features of the experimental setup (what is being held constant, what is being allowed to vary). However, as we will see, it is in the three classes of assumptions just described that we can discern the most important historical distinctions between the stimulus- and subject-centered approaches to scaling. In the next two sections I will present the historical origins of these two approaches before turning to consideration of the response model approach. A key point I will make is that when the type of attribute of interest in a measurement procedure is assumed to exist on a continuum, and the desired scale property is ratio or

270 Derek C. Briggs

interval, then there is really no theoretical difference between measurement and scaling. However, the ability to evaluate assumptions about the structure of an attribute and the property of a scale depends upon the kind of measurement procedure being invoked. The combination of measurement by fiat and the intention of establishing a scale for manifest data can lead to a situation in which measurement and scaling are viewed as two distinct activities, and this perspective has exerted a strong influence on contemporary practices of test scaling.

Gustav Fechner and a Stimulus-Centered Approach to Scaling If we had to pin a date on the point in time, at least in Western history, that methods were introduced for the purpose of measuring and scaling a psychological attribute, a fairly good case can be made that this date should be 1860, the culmination of a 10-year program of research by Gustav Fechner in which he proposed the concept of measurement through psychophysics. Fechner introduced methods of experimental design and statistics that he used to link known measures of physical attributes to the “psychical” sensory responses he believed the stimuli could be shown to evoke. If the physical and the mental were two sides of the same coin, as Fechner believed, then if one was measurable, so was the other. What came to be known as “Fechner’s Law” related the level of a physical stimulus magnitude, X, to a level of sensory intensity Y with the equation Y = k log X. To a great extent, Fechner’s Law was just a generalization of a previous empirical finding of one of his colleagues at the University of Leipzig, the physiologist Ernst Weber. In experiments Weber had conducted during the 1830s and 1840s, he had observed that when subjects are exposed to physical stimuli of two different magnitudes, the increment between magnitudes that is discernable is typically a constant fraction of the base stimulus. This finding, which today is still often described as “Weber’s Law,” helps to explain why consumers can be predicted to react to price increases very differently; relative to a base of $4, a $1 increase in the price of a latte is dramatic; relative to a base of $100, a $1 increase in the cost of a concert ticket would barely register. More formally, let X represent some physical quantity (for example, weight) for which some finite set of magnitudes can be both observed and reproduced. Consider any pair of magnitude values, xa and xb, which are compared and found to be just noticeably different such that xb > xa. The first way to express Weber’s finding is that for any two such magnitudes xb ¼C xa which indicates that the ratio between two magnitudes that are just noticeably different is a numeric constant, C. If we subtract 1 from the both sides of the equation, this can be re-expressed as

A History of Scaling 271

xb  xa ¼ C  1: xa A more general expression comes from defining x ¼ xb  xa as the change in any value of X that is necessary before a difference in magnitude will be just noticeable by a human subject, and also by defining k ¼ C  1 as another numeric constant that is a simple transformation of C. Weber’s finding then takes the form jndðxÞ ¼ x ¼ kx where jnd stands for a “just noticeable difference.” Weber’s interest had been in estimating and comparing values of k for sensory attributes such as touch, vision and hearing. Fechner took this one step further in proposing that these psychological attributes could be measured on a unique scale from the physical stimuli that motivated their expression. Initially, and in general, it cannot be denied that the mental sphere is subject to quantitative considerations. After all, we can speak of a greater or lesser intensity of attention, of the vividness of images of memory and fantasy, and of clearness of consciousness in general, as well as of the intensity of separate thoughts. … Higher mental activity, therefore, no less than sensory activity, the activity of the mind as a whole no less than in detail, is subject to quantitative determination. (Fechner, 1860 [1966]; pp. 46–47) If sensation intensity was quantitative, the scaling task was to establish measurement units for it. The goal of Fechner’s psychophysical research program was to demonstrate how this could be accomplished by establishing units for sensation scales in a variety of sensory domains with respect to jnd values. To do this, Fechner would conduct experiments in sensory discrimination, estimate the jnd values for multiple base magnitudes of physical stimuli, plot the jnds as a function of these magnitudes, and examine whether they could be fitted to a logarithmic curve. It was in the approach Fechner took to estimating jnd values that we see the first application of probability theory as a tool for psychological measurement. In Fechner’s most famous experimental approach, he focused on weight discrimination through use of the “method of right and wrong cases” or the “method of constant stimulus” (Fechner, 1860; Stigler, 1986). The crux of the experiment was to present a subject with two cups that differed in weight by a multiple of some designated increment. For example, the increment might be one quarter of an ounce. One cup was always the “control” stimulus weight; the other cup was the “treatment” stimulus weight. At some point, as the increment

272 Derek C. Briggs

in weight distinguishing the control and treatment cups is increased, it would become obvious that one is heavier than the other. The challenge was to identify the precise magnitude of the weight increment before the difference was “just” noticeable. Fechner decided that the place to look for this was for weight combinations where, upon replications of the same comparison, a subject would draw different conclusions about which cup was the heaviest. Why did subjects give different discriminations when presented with the same two cups? Fechner attributed this to measurement error in a very specific sense. Namely, Fechner posited that any time a person was exposed to a physical stimulus of a given magnitude, it triggered a corresponding sensation magnitude with both a fixed and random component. The random component was assumed to follow what was then known as the law of errors and is now known as a normal distribution. Therefore, any increasing sequence of physical stimulus magnitudes could be associated with a sequence of normally distributed sensation magnitudes with increasing means and constant standard deviations. Figure 12.2 illustrates the distribution of (latent) outcomes if it were possible to replicate the same comparison between stimulus magnitudes, fxc ; xt g, an infinite number of times. In this figure, note that the horizontal axis does not represent the continuum of the physical stimulus, but rather the unknown scale of psychological sensation that the physical stimulus elicits. When asked to compare two control and treatment physical stimuli, where in this example xt > xc, the average location of the respective psychological sensations would be μc and μt. According to Weber’s law, the distance between μc and μt is expected to be a proportion of the distance between xc and xc. The greater the distance, the less likely we are to observe an incorrect discriminatory judgment, on average. The spread of each distribution represents the sensitivity of an individual’s sensation to physical magnitudes; the greater the sensitivity, the narrower the spread. A person

μc FIGURE 12.2

μt

Hypothetical results from a comparison of weight with magnitudes xc and xt over replications. The scale of the x-axis is that of sensation intensity, not physical magnitude

A History of Scaling 273

makes an error in judgment because sometimes, the sensation elicited by stimulus xc will be greater in magnitude than the sensation elicited by stimulus xt. In short, the sensed difference can be cast as the difference between two normally distributed random variables. Fechner’s approach was to use the normal cumulative probability distribution function to model differences in physical stimulus magnitudes, which are observable, as a function of the mean and variance of sensation intensity differences, which are not. Using modern notation 1 Ptc ¼ P ðxt > xc Þ ¼ pffiffiffiffiffiffi 2

ð1 0

"



# 1 y  tc 2 exp  dy 2 tc

where μtc is the mean difference in sensation intensity, and σtc is the standard deviation of sensation differences. In his experiments, Fechner would take, for N replications of any pairing fxc ; xt g, the proportion of times xt was judged to be larger than xc as an estimate of Ptc , and then assuming that all tc were equal and setting them to 1, he would use the inverse of the normal cdf to compute an estimate of tc . After doing this for different permutations of fxc ; xt g pairings, Fechner could establish the relationship between observed physical differences and latent psychological ones, and then use this to get an interpolated estimate of the jnd at the location of tc where Ptc ¼ :75. If carried out in full, Fechner’s approach defined a scale for the measurement of sensation intensity in terms of jnd units. With this scale in place, and an established logarithmic relationship between jnds and levels of the physical quantity, then for any known value of the physical quantity, an associated sensory intensity magnitude could be measured relative to the absolute threshold—the number of jnd increments from the origin, defined to be 0 (in the absence of any physical stimulus). Relative to the framework for scaling shown in Figure 12.1, it can be seen that Fechner was assuming that the attribute of sensation intensity is a continuous quantity, that it is amenable to a procedure of derived measurement vis-à-vis its logarithmic relationship with physical magnitudes, and that it can be expressed on a ratio scale comprised of jnd units. Fechner’s specific approach was one of two major variants that differ in the way they go about establishing a measurement unit for the scale being constructed. Fechner was the originator of what Torgerson described as a “variability judgment” method of scaling, which is characterized by asking a subject or subjects to differentiate between stimuli that are presented in pairs. Across replications of these comparisons (by a single subject or a sample of subjects), a unit of measurement is established indirectly by the way that “errors” in judgment are being modeled. Louis Thurstone eventually reconceptualized Fechner’s Law as a special case of a more general law of comparative judgment (1927a, 1927b, 1927c). In contrast to classical psychophysics, which had taken the comparison of two physical

274 Derek C. Briggs

magnitudes as a starting point for measurement, under Thurstone’s approach even qualitative stimuli could be used as a basis for measurement, so long as the stimuli could be ordered. Thurstone saw that mathematically, there was little difference between modeling the probability of discriminating between the weight of two objects or, say, the legibility of two samples of handwriting5. In the former case, the order of the stimuli was known in advance; in the latter it would need to be inferred. But in either case, if errors in judgment were normally distributed, a scale with units defined by standard deviations could be established. If the measuring scales for physical attributes were composed of objective units, the scales Thurstone envisioned would be intentionally composed of subjective units, because they were defined by the variability of errors in human subjective judgments. As such, Thurstone would show that Fechner’s approach could be viewed as a special case of a law of comparative judgment that assumed equal standard deviations of subjective errors irrespective of the motivating stimuli6. A second variant of the stimulus-centered approach involves a subject or subjects making quantitative judgments when presented with stimuli. For example, given three different tones A, B and C, a subject could be asked to adjust the pitch of stimulus A so that it is halfway between B and C, or if given stimulus A as a reference, and subject might be asked to estimate the ratio by which B and C are greater or lesser in magnitude (e.g., Stevens, 1956). These judgments are represented numerically and averaged across replications, and it is these averages that become the basis for a scale’s unit of measurement. In the variability judgment approach, it was only assumed that subjects could discriminate order. In the quantitative judgment approach, it was assumed that subjects can discriminate both order and magnitude. Rather than defining a unit on the basis of errors in judgment, when using this method, the researcher attempts to minimize errors in judgments by averaging over replications7. Both Thurstone and Stevens introduced methods of scaling that embraced Fechner’s ethos that psychological properties could be measured, while either altering or loosening some of the assumptions Fechner had made in the process. The purpose of establishing scales with ratio or at least interval properties remained essential for all methods within a stimulus-centered approach.

Francis Galton and a Subject-Centered Approach to Scaling In the psychophysics tradition, measurement was premised on the validity of natural laws thought to govern the functioning of human sensory attributes. The basis for the units of sensation in Fechner’s original work came from the observation of within-person variability, which Fechner modeled according to the normal distribution. Although variability across persons in their sensory discrimination was recognized, psychophysicists showed little interest in describing this variability across some population of respondents or attempting to explain the causes of this variability. This became the focus for Francis Galton, who, spurred

A History of Scaling 275

by his interest in finding a mathematical law underlying the mechanism of human heredity, invented and popularized practical methods for measuring human attributes and placing them on one of two types of scales, those that were absolute in the sense that they could be demarcated by known physical units (e.g., length, speed, weight) and those that were relative in the sense that statistical demarcations could be inferred in terms of standard deviations from a mean. Galton’s innovations became the basis for a subject-centered approach to scaling, and ultimately led to the development of the field of educational measurement Once again, we see the important role played by probability theory and the normal distribution. Much like the Belgian astronomer, Adolphe Quetelet, who had been the first to apply it as a tool for “social physics” in 1835, Galton was struck by the ubiquity of the normal distribution. However Galton did even more than Quetelet to popularize its application, arguing that it applied not only to physical attributes, but to psychological ones as well. Galton’s reasoning was that not only are both types of attributes heritable, each can be conceptualized as a process of intergenerational inheritance, something that could be effectively modeled as the result of combining many small independent random events, making their distributions the deducible outcome of the central limit theorem8. Galton first introduced the notion of a “statistical scale” as a contrast to an absolute scale in his 1875 article On Statistics by Intercomparison with Remarks on the Law of Frequency of Error. The process of obtaining mean values etc. now consists in measuring each individual with a standard that bears a scale of equal divisions, and afterwards in performing certain arithmetical operations upon the mass of figures derived from these numerous measurements. I wish to point out that, in order to procure a specimen having, in one sense, the mean value of the quality we are investigating, we do not require any one of the appliances just mentioned: that is, we do not require (1) independent measurements, nor arithmetical operations; we are able to dispense with standards of reference, in the common application of the phrase, being able to create and afterwards indirectly to define them; and (2) it will be explained how a rough division of our standard into a scale of degrees may not infrequently be effected. Therefore it is theoretically possible, in a great degree, to replace the ordinary process of obtaining statistics by another, much simpler in conception, more convenient in certain cases, and of incomparably wider applicability. Nothing more is required for the due performance of this process than to be able to say which of two objects, placed side by side, or known by description, has the larger share of the quality we are dealing with. (Galton, 1875; p. 34) In other words, if a variable was known to be normally distributed, so long as it would be possible for a qualified external observer to rank order people with respect to some attribute of interest, these ranks could be converted into a

276 Derek C. Briggs

percentile estimate (i.e., the proportion of people with lower values of the variable), and using the inverse of the normal cumulative distribution function, this percentile could be located on a scale of standard deviation units. Even in situations where “absolute” measurement on a known scale was possible, it would only be necessary to take the values at the median and the 25th or 75th percentile in order to convert the statistical scale back into the units of the original scale. In situations where no absolute measurement was possible, “relative” measurement would be the next best thing and suffice for most practical purposes. A knowledge of the distribution of any quality enables us to ascertain the Rank that each man holds among his fellows, in respect to that quality. This is a valuable piece of knowledge in this struggling and competitive world, where success is to the foremost, and failure to the hindmost, irrespective of absolute efficiency. A blurred vision would be above all price to an individual man in a nation of blind men, though it would hardly enable him to earn his bread elsewhere. (Galton, 1889; p. 36) Before the turn of the 20th century, Galton had introduced and applied concepts and methods that remain at the core of contemporary educational scaling practices: representing a frequency distribution with respect to the normal ogive and percentiles (two terms he invented and popularized), the normalization of ranks, normative scale interpretations, and criterion-referenced scale interpretations. Let us consider an example of the latter in more detail. Galton had speculated that human intelligence might be associated with the ability to reproduce physical images from memory, and that those capable of doing so could create a mental image that had illumination, definition and color that closely matched the original. To explore this theory, he devised a survey pertaining to mental imagery and administered it to a purposeful sample of 100 men from a variety of professional walks of life. It was the results from this second survey that he would report upon in his book Inquiries into Human Faculties. Those with experience in such endeavors can surely empathize with Galton’s sentiment that “there is hardly any more difficult task than that of framing questions which are not likely to be misunderstood, which admit of easy reply, and which cover the grounds of inquiry” (Galton, 1883; p. 84). Galton created three distinct items associated with a common task in which respondents were asked to create a mental image. Both the common task and the item are depicted in Figure 12.3. On each of the three items, subjects were asked to introspectively evaluate the quality of the images they had produced. Galton then took these responses, placed them into the three ordered categories of low, mediocre and high, and within each of these categories delineated them further in ranks of ascending order. In a last step, assuming that the attribute he had identified with respect to the quality of a mental image followed a normal distribution, he

A History of Scaling 277

Before addressing yourself to any of the Questions on the opposite page, think of some definite object—suppose it is your breakfast-table as you sat down to it this morning— and consider carefully the picture that rises before your mind’s eye.

1. 2. 3.

Illumination. Is the image dim or fairly clear? Is its brightness comparable to that of the actual scene? Definition. Are all the objects pretty well defined at the same time or is the place of sharpest definition at any one moment more contracted than it is in a real scene? Colouring. Are the colours of the china, of the toast, bread-crust, mustard, meat, parsley, or whatever may have been on the table, quite distinct and natural?

FIGURE 12.3

Galton’s survey items for mental visualization

could demonstrate how through his method of relative measurement, rankings could be converted into a statistical scale. Table 12.1 is my representation of the scale Galton provided as an example for the ability to create a mental image that was clearly illuminated. Notice that although Galton was leveraging his assumption that this attribute was normally distributed to locate his respondents on a statistical scale, he was establishing the interpretation of the scale with respect to qualitative descriptors he had inserted at designated intervals of the distribution. Here we see a concrete realization of Galton’s approach for both assembling a statistical scale through relative measurement, while also seeking out reference points along the scale in a manner that was a rough approximation of the paradigm of measurement in thermometry. Relative to the scaling framework shown in Figure 12.1, it can be seen that Galton was assuming that psychological attributes exist on a continuum with a normal distribution, that they are measurable by fiat, and that they can be expressed on an interval scale. Although Galton is deservedly much more famous for his discovery of correlational methods as a tool for studying individual differences (and also deservedly notorious for his fervent promotion of eugenics) he might not have arrived at the insight needed to derive an index of correlation without his practice of converting even physical measurements into units defined by standard deviations (Briggs, in press; Bullmer, 2004; Clauser, Ch. 8, this volume; Stigler, 1986). The scaling of mental tests began, and has to a large extent remained, an instantiation of Galton’s subject-centered approach to relative measurement. After a test has been administered subjects are scored for the quality of their responses and these scores are summed. Subjects are then either located directly on this raw score scale or a new scale is created by imposing a linear or nonlinear transformation. In the latter case, Galton’s approach of relative measurement was to impose what amounted to a nonlinear transformation to convert ranks into normal unit deviates, and variants of this basic approach were taken up by Thorndike (1910), Thorndike, Woodyard, Cobb & Bregman (1927), and Thurstone (1938). In other instances,

278 Derek C. Briggs

TABLE 12.1 Statistical scale with descriptive reference points for illumination of visualized

mental image. Based on Galton, 1883, p. 93 Z

%ile

Descriptive Scale Anchor

3 1

99.7 84

The image once seen is perfectly clear and bright I can see my breakfast table or any equally familiar thing with my mind’s eye quite as well in all particulars as I can do if the reality is before me. Fairly clear; illumination of actual scene is fairly represented. Well defined. Parts do not obtrude themselves, but attention has to be directed to different points in succession to call up the whole. Fairly clear. Brightness probably at least from one-half to two-thirds of the original. Definition varies very much, one or two objects being much more distinct than the others, but the latter come out clearly if attention be paid to them. Dim, certainly not comparable to the actual scene. I have to think separately of the several things on the table to bring them clearly before the mind’s eye, and when I think of some things the others fade away in confusion. Dim and not comparable in brightness to the real scene. Badly defined with blotches of light; very incomplete; very little of one object is seen at one time. I am very rarely able to recall any object whatever with any sort of distinctness. Very occasionally an object or image will recall itself, but even then it is more like a generalized image than an individual one. I seem to be almost destitute of visualizing power as under control.

.71

75

0

50

-.71

25

-1

16

-3

0.3

Note: Z represent standard deviation units, %ile represents percent of the distribution below this location of the scale.

raw test scores were linearly transformed into some form of standard deviation units (Hull, 1922; McCall, 1939). Whether the scale was based on linear or nonlinear transformations, the interpretation of scale values was primarily normative in nature, and the earliest literature on the scaling of educational tests reflects this focus (Flanagan, 1951; Gardner, 1962; Lindquist & Hieronymus, 1964; Terman & Merrill, 1937, 1960). Interestingly, it was not until Ebel (1962) and Angoff (1971) that we find some attempt to a form of scale with criterionreferenced anchors along the lines of Galton’s mental imagery example.

Thurstone, Invariance and Response Model Methods of Scaling Thurstone was the first real bridge between the subject- and stimulus-centered approaches to scaling. Within a remarkable four-year span between 1925 and 1929, he introduced an absolute method for test scaling, a quantitative judgment method of attitude scaling, and a variability judgment method for scaling qualitative stimuli. In the Thurstonian approach, even when the ultimate objects of measurement were subjects as opposed to stimuli, the measurement procedure

A History of Scaling 279

was conceptualized as a two-stage process. In a first stage, the stimuli (e.g., intelligence test items in Thurstone, 1925, and attitude statements in Thurstone, 1928b) were placed onto an absolute scale of difficulty. In the second stage, the resulting scale was used as the instrument for measuring some designated attribute of a person (e.g., intelligence, attitude toward prohibition). The first stage was scale construction, the second stage was application of the scale to produce numeric measures. Relative to the framework in Figure 12.1, Thurstone was always consistent in assuming, prior to construction of any scale, the existence of one or more psychological attributes that could be represented on a quantitative continuum. It was a continuum that was distinct from the stimuli to which subjects could be exposed, and the judgments they could render. Thurstone also made explicit that the goal of his scaling enterprises was to construct a scale that could satisfy requirements for order, distance and origin, writing, for example, that “The whole study of intelligence measurement can hardly have two more fundamental difficulties than the lack of a unit of measurement and the lack of an origin from which to measure!” (Thurstone, 1928a; p. 176). Another defining feature of the Thurstonian approach to scaling and measurement was his emphasis on the condition that the scale must be invariant to the specific conditions of its construction. One of Thurstone’s more explicit discussions of this condition came in the context of his scaling method of equal appearing intervals for the purpose of measuring attitudes. The scale must transcend the group measured. One crucial experimental test must be applied to our method of measuring attitudes before it can be accepted as valid. A measuring instrument must not be seriously affected in its measuring function by the object of measurement. To the extent that its measuring function is so affected, the validity of the instrument is impaired or limited. If a yardstick measured differently because of the fact that it was a rug, a picture, or a piece of paper that was being measured, then to that extent the trustworthiness of that yardstick as a measuring device would be impaired. Within the range of objects for which the measuring instrument is intended its function must be independent of the object of measurement. We must ascertain similarly the range of applicability of our method of measuring attitude. It will be noticed that the construction and the application of a scale for measuring attitude are two different tasks. If the scale is to be regarded as valid, the scale values of the statements should not be affected by the opinions of the people who help to construct it. This may turn out to be a severe test in practice, but the scaling method must stand such a test before it can be accepted as being more than a description of the people who construct the scale. (Thurstone, 1928b; p. 228) Because Thurstone’s approach was cast within a statistical modeling framework in which it was possible to either corroborate or falsify the invariance criterion, it

280 Derek C. Briggs

can be viewed as an instance of what Torgerson had in mind in his broadened conceptualized of a fundamental measurement procedure. Falsification of the invariance criterion could call into question both the interval or ratio property of the scale that had been constructed, as well as the assumption that the attribute in question could be represented on a unidimensional continuum. By the early 1950s, a novel approach to scaling had just recently been introduced that suggested an even more direct bridge between subject- and stimulus-centered approaches. In a response model approach, when presented with a stimulus and asked to either render a quantitative judgment or make a comparison, the response that is observed from a subject depends both upon the location of the stimuli on a hypothetical continuum, and on the distance of the subject from this location. The first and most famous instance of a deterministic model for scaling both subjects and stimuli was Louis Guttman’s scalogram analysis (Guttman, 1944, 1950). The first examples of probabilistic models that could be used to scale both subjects and stimuli were Lazarsfeld’s latent distance model (Lazersfeld, 1950), and the normal-ogive models introduced by Tucker (1948, 1952) and Lord (1952, 1953). This would set the stage for Birnbaum’s contribution of the logistic forms of item response theory models (Birnbaum, 1968). For all their potential to simultaneously scale both subjects and stimuli, the response model approaches presented significant challenges. Because the deterministic modeling approach introduced by Guttman made no allowance for measurement error, it would almost never entirely fit the data to which it was applied, and when it did not, there was little basis for distinguishing model deviations that threatened “scalability” from deviations that could be attributed to chance. Such distinctions were theoretically possible for the probabilistic models, but in an era well before the personal computer, the estimation of the person and item parameters seemed to present an intractable problem. Another challenge lay in evaluating scale properties. Where such a task was non-trivial for stimulusbased scaling (hinging upon, for example, demonstrations of transitivity and additivity of differences), in a response model approach the task became twodimensional. That is, a demonstration of order among stimuli would require, for example, transitivity for stimuli relationships for all unique groupings of subjects along an underlying continuum, and vice versa. The conditions for a new type of fundamental measurement along these lines would soon be elucidated under a deterministic axiomatic framework by Luce & Tukey (1964), and the contributions of Suppes, Luce, Krantz and Tversky would lead to the three volume Foundations of Measurement series with the first volume published in 1971. A probabilistic response model that was consistent with both Luce and Tukey’s notion of a scale premised on additive conjoint measurement, and Thurstone’s criterion of invariance, was Rasch’s Poisson and logistic models, introduced in Probabilistic Models for Some Intelligence and Attainment Tests (Rasch, 1960)9. Greater detail on these response model approaches to the simultaneous scaling of both subjects and stimuli is outside the scope of this chapter, but see Coombs

A History of Scaling 281

(1964), Andrich (1978), Wright (1997), and Jones & Thissen (2006). What is important to appreciate is that by the mid-20th century, an expansive literature on scaling was available, and during the 1950s in particular, one can see considerable effort being taken to establish some taxonomies for different scaling approaches, and to develop a theoretical framework by which the approaches could be compared and distinguished. I will now juxtapose this history with more contemporary practices in educational scaling, in which measurement and scaling are often viewed as distinct activities.

The More Recent History of Educational Scaling Based on the history we have covered so far, it would seem that Lord was right when he claimed that scaling was virtually indistinguishable from measurement. We need to cover somewhat different territory to uncover some pressure points on this position, and to that end, we will jump ahead to the 4th edition of the edited volume Educational Measurement, published in 2006. Within this book, the chapter “Scaling and Norming” by Michael Kolen provides a valuable compendium of different methods that have been applied as part of the American achievement testing industry to place the results from a test administration onto an interpretable scale. Strategies are presented for placing test scores onto a scale metric that makes it possible to compare subjects either normatively relative to other students, or in a criterion-reference manner with respect to the content of the test. An emphasis is placed on using information about score precision to decide upon an adequate unit of measurement. Finally, attention is devoted to the use of a score scale to convey information about growth over time. Relative to the history of scaling we have covered so far, a remarkable feature of this chapter, the one that preceded it in the 3rd edition of Educational Measurement (Peterson, Kolen & Hoover, 1989) and the book Test Equating, Scaling and Linking (Kolen & Brennan, 2004), is the extent to which methods of scaling specific to the particular context of educational achievement tests have been decoupled from any broader theoretical framework. When situated within the Torgerson-inspired framework depicted in Figure 12.1, the methods presented in these three influential publications reside exclusively within the subject-centered approach to scaling. What is being assumed about the relationship between scaling methods and the kind of measurement envisioned, the type of attribute being measured, and the desired properties of the resulting scale with respect to order, distance and origin? One can only guess because these sorts of questions are never taken up10. No distinction is made between a subject- or stimulus-centered approach to scaling. Item response theory, which was just emerging as a novel response model approach of combining subject- and stimulus-centered in Torgerson’s book, is cast as just another method for computing a subject-specific “raw” score. To the extent that there is an underlying theory of scaling that motivates the methods being presented, it is surely that of practicality. Namely, if

282 Derek C. Briggs

a scaling procedure facilitates the interpretation that was intended—while also discouraging interpretations that were not—it can be considered successful. One notable common feature of Peterson, Kolen & Hoover (1989), Kolen & Brennan (2004) and Kolen (2006) is the affiliation of the contributing authors with the University of Iowa and the various testing programs founded by Everett Franklin Lindquist. All three include as a credo of sorts the following quote from Lindquist, pulled from a short commentary he had contributed to an invited session on educational scaling hosted at the Educational Testing Service in 1952. A good educational achievement test must itself define the objective measured. The method of scaling an educational achievement test should not be permitted to determine the content of the test or to alter the definition of the objectives implied by the test. From the point of view of the tester, the definition of the objective is sacrosanct; he has no business monkeying with that definition. The objective is handed down to him by those agents of society who are responsible for decisions concerning educational objectives, and what the test constructor must do is attempt to incorporate that definition as clearly and as exactly as possible in the examination that he builds. (Lindquist, 1953; p. 35) In all three publications, after this credo is presented, the reader is informed that “scaling methods that, for example, involve removing items from a test that do not fit a particular statistical model will not be considered here11.” Some context for the Lindquist credo being invoked in these publications is in order12. Lindquist was a former high school teacher and district superintendent from the small town of Gowrie, Iowa. By 1927 he had completed a doctoral thesis on the reliability and validity of written compositions, after taking an interest in educational testing as part of his graduate studies at the University of Iowa. Just one year later, in 1928 he earned a faculty appointment, and one of Lindquist’s first responsibilities was to design a test that could be used for an annual state academic contest that became known by locals as the “brain derby.” The experience led him to question the value of a test targeted solely to the state’s academic elite, and in 1935 Lindquist, with the help of a talented staff he had recruited to Iowa, introduced the Iowa Test of Basic Skills for elementary grade children. By the early 1940s Lindquist had also created the Iowa Test of Educational Development as a generalized test of educational skills expected of high school students, a test that would ultimately evolve into the ACT college admissions exam. By 1960, Lindquist had spearheaded the development and patenting of the first optical scanner, and the invention had a tremendous impact on the expansion of large-scale achievement testing on a national scale (Peterson, 1983). Although it was Edward Thorndike and Lewis Terman who laid much of the groundwork for an industry of achievement testing in the United States during the first three decades of the 20th century, it can be argued that between 1930

A History of Scaling 283

and 1950 it was Lindquist who did the most to turn testing into something of an amalgam as not only an industry of testing practitioners, but also a field devoted to the improvement of this practice under the heading of “educational measurement.” In 1936 Lindquist had been one of the primary authors of what was essentially a “handbook” of achievement testing principles entitled The Construction and Use of Achievement Examinations. In 1951, it was Lindquist who edited the first edition of Educational Measurement, in which he also authored a chapter entitled “Preliminary Considerations in Objective Test Construction.” Lindquist’s conceptualization of educational measurement in this chapter is revealing. An educational achievement test is described as a device or procedure for assigning numerals (measures) to the individuals in a given group indicative of the various degrees to which an educational objective or set of objectives has been realized by those individuals. Whether or not an educational objective has been realized in any individual can be ascertained only through his overt behavior. (Lindquist, 1951; p. 142, emphasis added) In short, to the extent that a test produces a measure of achievement, it does so through the definition of test objectives that are written so that they can be related to specific, socially agreed-upon behavioral outcomes. Lindquist’s perspective here was consistent with a prevalent philosophy of science at the time, known as operationalism (Bridgman, 1927), and with a theory of learning premised on behaviorism13. Under Lindquist’s conceptualization, the data from any test represents a sample from some population of possible items which are themselves the best attempt at a predictive operationalization of some behavioral domain of interest14. Hence, the validity of educational measurement is about the successful generalization from test sample to behavioral domain15. It is the manifest data itself that is being scaled, not a psychological attribute of a student assumed to exist on a continuum. The inference of interest is not to a location on this continuum, but a distal observable outcome in the behavioral domain. In his commentary, Lindquist (1953) had described scaling as an activity that is undertaken for three purposes. The first (and in his view most important) was to facilitate comparisons between different forms of the same test or across a collection of tests of different subjects. For this purpose, a scale was needed primarily as a convenient, intermediary means of equating the raw scores from test forms that could differ in their difficulty. The second purpose of scaling was to allow for comparisons of differences in performance relative to the content of a test (i.e., criterion-referenced comparisons), and the third was to allow for comparisons that were relative to the population of subjects who were taking the test (i.e., normative comparisons). Although Lindquist appreciated the desirability of an absolute scale, and by extension the desirability of locating test items on such a scale, he was skeptical that such an approach was possible in the context of

284 Derek C. Briggs

achievement testing. He argued that there could be no such thing as an absolute location for an item. From his own experiences with the Iowa testing program, it had been evident to him that the apparent difficulty of an item typically varied from school to school and grade to grade as a direct consequence of unpredictable choices schools and teachers made in the organization of their curriculum. Thus, Lindquist appears to have rejected the response model approach to scaling that was just emerging in the early 1950s for two primary reasons. The first was that he regarded it as implausible that what we would today describe as parameter invariance would hold in the context of students with very different educational experiences. The second was his implicit commitment to an operationalist philosophy to measurement which began with definitions of test objectives that were “sacrosanct,” having been “handed down by agents of society.” By this logic, the removal of an item written to satisfy a test objective (because, for example, it did not demonstrate invariance across schools), would constitute a de facto change to the operationally defined measure. In taking this perspective, Lindquist was effectively divorcing educational scaling not only from the broader literature on psychological scaling that began with Fechner’s psychophysics, but even from the theoretical origins of the subjectcentered approach taken by Galton. After all, when Galton had introduced his technique of converting ranks into normal unit deviates, he had described it as relative or indirect measurement because an instrument designed to transduce a targeted human attribute onto a numeric scale either did not exist, or was too time-consuming to administer efficiently on a large scale. However, because Galton assumed that the underlying attribute existed on a quantitative, normally distributed continuum, even his “relative” measurement could be claimed to produce a scale with equal intervals, so long as one was willing to accept his rationale. Not everyone was, of course, and through the mid-20th century, one can find recurring instances of a debate over the plausibility of what was being assumed about the structure of non-physical attributes, and the scales upon which they could be expressed (cf., Boring, 1920; Kelley, 1923). Indeed, aspects of this debate could still be gleaned from much of the published literature on test scaling during Lindquist’s era (e.g., Lord, 1954). Lindquist, by casting measurement in a domain sampling framework, was able to stake out his own unique conceptualization of measurement, and it is one that can seemingly remain agnostic to ontological questions about underlying psychological attributes and their structure. Such questions could be left for quantitative psychologists to haggle over when it came to psychological scaling. In educational testing contexts, Lindquist’s legacy, most evident in Kolen & Brennan (2004) and Kolen (2006), is the perspective that measurement and scaling are two conceptually and analytically distinct activities. Measurement happens through the operational construction of tests according to educational objectives. Scaling happens after the fact in the attempt to make test scores interpretable relative to these objectives. As for scale properties with respect to order, distance and origin? These are questions

A History of Scaling 285

that the Lindquist-inspired approach is not well-equipped to answer, and as such they are not taken up as questions worth asking. Although this strikes me as a fair characterization of the philosophy underlying Lindquist’s credo, it is worth noting that it was not one that was universally embraced at the time (cf., Tucker, 1952; Ebel, 1962; Flanagan, 1962; Gardner, 1962). Lord and Angoff, who seem to have shared much of Lindquist’s skepticism about the prospects for establishing educational scales with units that were analogous to those created for physical scales, were nonetheless well-acquainted with the methods that were being proposed for such purposes. In fact, Angoff (1971), seems to have viewed the structure of a scale as fundamental to the very meaning of scaling as an activity16. Both expressed some optimism for the prospects of item response theory to offer a model-based paradigm for educational scaling. For Angoff, the appeal of this new paradigm was the flexibility to simultaneously scale, norm and equate test scores; for Lord, the appeal seems to have come from establishing statistical models whose adequacy could be corroborated empirically. Interestingly, the first chapter of Lord and Novick’s classic Statistical Theory of Mental Test Scores in 1968 begins by connecting the model-based approach to measurement they introduce in their book to the evolving notions of fundamental measurement in psychology that had been presented in Torgerson (1958), Suppes and Zinnes (1963) and Luce and Tukey (1964). Whether the approach to measurement they had formulated for testing contexts can be viewed as “essentially” consistent with Suppes & Zinnes (1963) and Luce & Tukey (1964) (as Lord & Novick claim at the end of their opening chapter) is an interesting question, because it has implications for the kinds of properties that can be established for test scales and the ways that these properties might be evaluated. Following the publication of the 2nd edition of Educational Measurement in 1971, the mainstream literature on educational scaling has remained exclusively focused on the exposition and application of purely subject-centered methods, and in effect, contemporary scaling practice seems closely aligned with Lindquist’s operationalist credo (see, for example, Tong & Kolen, 2010). There is no longer much contact in the mainstream literature with either the historical foundations of educational scaling or its theoretical and philosophical rationale. Perhaps as a consequence, few contemporary practitioners of educational scaling would be capable of providing a coherent answer to the question of whether or when test score scales can support distinctions between students based solely on order, as opposed to distinctions based on order and distance (Briggs, 2013). It is also quite likely that a practitioner would be befuddled if faced with questions about the kind of measurement procedure that led to the scale, or the type of attribute that is being scaled.

Summary Torgerson’s 1958 book, and the conceptual framework for scaling that it provides, came at an interesting pivot point in the history of scaling in educational

286 Derek C. Briggs

and psychological contexts. As we have seen, the earliest efforts to create a scale for a psychological attribute can be traced to the psychophysics of Gustav Fechner in 1860, and the correlational study of individual differences ushered in by Francis Galton near the turn of the 20th century. The Fechnerian and Galtonian traditions are classic examples of stimulus- and subject-centered approaches to scaling, and with the exception of Thurstone’s seminal efforts between 1925 and 1929, each developed for many decades somewhat independently of the other. The lack of contact between the two traditions was unfortunate, as they shared a number of similarities in their reliance on probability theory in general and the normal distribution in particular to motivate the measurement units of their respective scales. It was not until the publication of Guilford’s Psychometric Methods in 1936 that the two different traditions were presented together as different, but possibly complementary, techniques under the larger umbrella of mental measurement. The response model approach to scaling that had just appeared at the time of Torgerson’s book, and then developed rapidly thereafter, pointed toward a possible rapprochement between the stimulus- and subject-centered approaches. In the context of educational scaling, this rapprochement has never occurred, and in large part, I have argued, this is a legacy of Lindquist’s perspective in which measurement and scaling can be viewed as distinct and even independent activities. One consequence is that response modeling approaches are often viewed as just another variant of a subject-centered approach to scaling, but this misrepresents the motivation behind their initial development. This motivation was to simultaneously scale both subjects and stimuli, satisfy the conditions of invariance so clearly stipulated by Thurstone, and suggest the possibility of defining an interval for a scale unit through the tradeoff between persons and items. Lindquist, as I noted, had good reason to be skeptical that such ambitions could be realizable. But he and those who have followed in his footsteps may have been a bit hasty to have rejected them out of hand. Scientific progress seems most likely to occur when theories are placed in competition with one another. When one leaves out or is ignorant of the theoretical underpinnings of scaling methods altogether, this can become a recipe for practices that are difficult to explain and/ or defend. In this sense, new generations of psychometricians and educational measurers interested in the topic of scaling would be well-served to revisit and reconsider the past before turning to the future.

Notes 1 I would like to thank Neil Dorans, Andy Maul and Bob Brennan for their comments on my initial draft of this chapter. 2 Prior to Torgerson’s book, the most complete treatment of scaling was to be found in Chapters 2–9 of J. P. Guilford’s textbook Psychometric Methods (Guilford, 1936; 1954). 3 Torgerson’s conceptualization of magnitude here is taken directly from Bertrand Russell. For more on this, see Michell (1993, 1999).

A History of Scaling 287

4 They would also fall within the Stevens definition, but not the more standard definition. It is outside the scope of this chapter to discuss in more detail the origins and distinctions between different conceptualizations of measurement. Michell (1999) is the seminal reference. For my own (evolving) perspective on the nature and meaning of measurement in the human sciences, see Briggs, Maul & McGrane (forthcoming) and Briggs (in press). 5 This was actually an insight that had already occurred to James Cattell and was first implemented by his students, Edward Thorndike, who was the first to use the approach to develop a scale for the measurement of handwriting legibility (Thorndike, 1910). But it was Thurstone who did the most to expand the approach into a more general framework. 6 Over time, Thurstone’s approach to psychological scaling has been taken up and expanded as part of the literature on the modeling of preferences and choice (Bradley & Terry, 1952; Luce & Edwards (1958); Luce, 1959; Bock & Jones, 1968), as well as for contexts in which the attributes to be scaled could be multidimensional (Torgerson, 1952; Shepard, 1962) 7 Stevens (1957) objected to the logic of the variability judgment approach which he referred as “confusion scales” and instead actively promoted quantitative judgment approaches as the superior alternative (Stevens, 1975). 8 More formally, the crux of the central limit theorem, first introduced by Pierre Laplace in 1777, and then further refined in 1811, is that if a sum is taken of a series of independent random variables, the probability distribution function of the resulting sum across infinite replications will converge to the normal distribution as the series being summed gets increasingly large. Most powerfully, so long as the number of variables combined to form the sum is large, the sum will follow a normal distribution even if the individual variables that comprise the sum do not. 9 For more on the unique affordances and possibilities for scales constructed using the Rasch family of response models see Andrich (1995), Fischer & Molenaar (1995), Humphry (2011), Briggs (2019) and Engelhard & Wind (Ch. 15, this volume). 10 Peterson, Kolen & Hoover (1989) devote a single paragraph to other “perspectives” on scaling. Kolen & Brennan (2004) provide a treatment of 5 pages in a 548-page book. In Kolen (2006), a section of perspectives on scaling is a shortened version of what was presented in Kolen & Brennan (2004). None of these publications reference Torgerson (1958). 11 This is likely an allusion to the Rasch Model tradition of educational scaling in which items that do not fit the model would need to be revised or removed before they can be included in an achievement test. Taken at face value however, this is a puzzling statement. During test development a significant number of items are commonly removed because they do not fit a statistical model (e.g., the items are too hard or too easy, the items fail to discriminate, the items show evidence of bias). It is really just a question of what statistical model is being conceptualized, and how it is being used. 12 I want to thank H. D. Hoover for providing me with this background. 13 For a nice primer on the philosophical foundations of educational and psychological measurement, see Maul, Torres Irribarra & Wilson (2016). 14 Lindquist referred to tests for which it was impossible to sample from the behavioral domain as instances of “indirect measurement.” By contrast, in direct measurement one could create items that were taken directly from the behavioral domain. A test of typing speed would be a direct measure of the behavioral domain (e.g., efficient job performance); a test of social studies would be an indirect measure of the behavioral domain (e.g., regular participation in civic responsibilities as an adult). 15 This basic idea still lives on in Michael Kane’s conceptualizion of a validity argument (Kane, 1992, 2006). 16 Angoff writes “A first requirement for the transmission of the scores is that an appropriate scale structure be defined. This process of definition will be denoted by the term scaling.”

288 Derek C. Briggs

References Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43 (4), 561–573. https://doi.org/10.1007/BF02293814. Andrich, D. (1995). Distinct and incompatible properties of two common classes of IRT Models for Graded Response Models. Applied Psychological Measurement, 19(1), 101–119. Angoff, W. H. (1971). Scales, norms, and equivalent scores. In R. L. Thorndike (Ed.), Educational Measurement (2nd Edition). Washington, DC: American Council on Education, 508–597. Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores. Reading, MA: Addison-Wesley, 397–479. Bock, R. D., & Jones, L. V. (1968). The measurement and prediction of judgment and choice. San Francisco, CA: Holden-Day. Boring, E. G. (1920). The logic of the normal law of error in mental measurement. The American Journal of Psychology, 31(1), 1–33. Bradley, R. A., & Terry, M. E. (1952). Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika, 39, 324–345. Bridgman, P. W. (1927). The logic of modern physics. New York, NY: Macmillan. Briggs, D. C. (2013). Measuring growth with vertical scales. Journal of Educational Measurement, 50 (2), 204–226. Briggs, D. C. (2019). Interpreting and visualizing the unit of measurement in the Rasch model. Measurement, 146, 961–971. Briggs, D. C. (in press). Historical and conceptual foundations of measurement in the human sciences. New York: Routledge. Briggs, D. C., Maul, A., & McGrane, J. (forthcoming). On the nature of measurement. In L. Cook and M. Pitoniak (Eds.) Educational measurement, 5th Edition. Bulmer, M. G. (2003). Francis Galton: Pioneer of Heredity and Biometry. Baltimore, MD: Johns Hopkins University Press. Campbell, N. R. (1920). Physics, the elements. Cambridge, UK: Cambridge University Press. Campbell, N. R. (1928). An account of the principles of measurement and calculation. London: Longman, Green & Co. Comrey, A. L. (1950). A proposed method for absolute ratio scaling. Psychometrika, 15, 317–325. Coombs, C. H. (1950). Psychological scaling without a unit of measurement. Psychological Review, 57, 145–158. Coombs, C. H. (1964). A theory of data. New York: John Wiley & Sons. Ebel, R. L. (1962). Content standard test scores. Educational and Psychological Measurement, 22, 15–25. Fechner, G. T. (1860). Elemente der Psychophysik. Leipzig: Breitkopf and Hartel; English translation by H. E. Adler, 1966, Elements of Psychophysics, Vol. 1, D. H. Howes & E. G. Boring (Eds.), New York: Rinehart and Winston. Fischer, G. H. & Molenaar, I. W. (1995). Rasch Models: Foundations, Recent Developments, and Applications. New York: Springer. Flanagan, J. C. (1951). Units, scores and norms. In E. F. Lindquist (Ed.), Educational Measurement. Washington, DC: American Council on Education, 695–763. Flanagan, J. C. (1962). Symposium: Standard scores for achievement tests. (Discussion) Educational and Psychological Measurement, 22, 35–39.

A History of Scaling 289

Galton, F. (1875). Statistics by intercomparison with remarks on the law of frequency of error. Philosophical Magazine, 49, 33–46. Galton, F. (1883). Inquiries into Human Faculty and Its Development. London: Macmillan. Galton, F. (1889). Natural Inheritance. London: Macmillan. Gardner, E. F. (1962). Normative standard scores. Symposium: Standard scores for aptitude and achievement tests. Educational and Psychological Measurement, 22, 7–14. Gulliksen, H. (1950). Theory of mental tests. New York: Wiley. Guttman, L. (1944). A basis for scaling qualitative data. American Sociological Review, 7, 362–369. Guttman, L. (1950). Chapters 2, 3, 6, 8 and 9 in Stouffer, et al. (Eds.), Measurement and Prediction. Princeton, NJ: Princeton University Press. Hull, C. L. (1922). The conversion of test scores into series which shall have any assigned mean and degree of dispersion. Journal of Applied Psychology, 6, 298–300. Humphry, S. (2011). The role of the unit in physics and psychometrics. Measurement: Interdisciplinary Research and Perspectives, 9(1), 1–24. Jones, L. V. (1971). The nature of measurement. In R. L. Thorndike (Ed.), Educational Measurement, 2nd Edition. Washington DC: American Council on Education, 335-355. Jones, L. V., & Thissen, D. (2006). A history and overview of psychometrics. In C. Rao & S. Sinharay (Eds.), Handbook of Statistics Vol. 26: Psychometrics. London: Elsevier, 1–27. Kane, M. T. (1992). An argument-based approach to validity. Psychological Bulletin, 112, 527–535. Kane, M. (2006). Validation. In R. Brennan (Ed.), Educational measurement, 4th edition. Westport, CT: Praeger, 17–64. Kelley, T. L. (1923). The principle and techniques of mental measurement. The American Journal of Psychology, 34(3), 408–432. Kolen, M. J. (2006). Scaling and Norming. In R. Brennan (Ed.), Educational measurement, 4th edition. Westport, CT: Praeger, 155–186. Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking: Methods and practices. New York: Springer. Krantz, D. H., Luce, R. D., Suppes, P., & Tversky, A. (1971). Foundations of measurement, Vol. 1: Additive and polynomial representations. New York: Academic Press. Lazersfeld, P. F. (1950). Chapters 10 and 11 in Stouffer, et al. (Eds.), Measurement and Prediction. Princeton, NJ: Princeton University Press. Lindquist, E. F. (1951). Preliminary considerations in objective test construction. In E. F. Lindquist (Ed.), Educational Measurement. Washington, DC: American Council on Education, 119–158. Lindquist, E. F. (1953). Selecting appropriate score scales for tests. (Discussion) In Proceedings of the 1952 Invitational Conference on Testing Problems. Princeton, NJ: Educational Testing Service, 34–40. Lindquist, E. F., & Hieronymus, A. N. (1964). Iowa Test of Basic Skills: Manual for administrators, supervisors and counselors. New York: Houghton Mifflin. Lord, F. M. (1952). A theory of test scores. Psychometric Monographs, 1952, No. 7. Lord, F. M. (1953a). An application of confidence intervals and of maximum likelihood to the estimation of examinee’s ability. Psychometrika, 18, 57–76. Lord, F. M. (1953b). The relation of test score to the trait underlying the test. Educational and Psychological Measurement, 13, 517–548. Lord, F. M. (1954). Scaling. Review of Educational Research, 24, 375–393. Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores: Some latent trait models and their use in inferring an examinee’s ability. Reading, MA: Addison-Wesley.

290 Derek C. Briggs

Luce, R. D. (1958). A probabilistic theory of utility. Econometrica, 26(2), 193–224. https:// doi.org/10.2307/1907587. Luce, R. D., & Edwards, W. (1958). The derivation of subjective scales from just noticeable differences. Psychological Review, 65(4), 222–237. https://doi.org/10.1037/ h0039821. Luce, R. D., & Tukey, J. W. (1964). Simultaneous conjoint measurement: A new type of fundamental measurement. Journal of Mathematical Psychology, 1(1), 1–27. Maul, A., Mari, L., Torres Irribarra, D., & Wilson, M. (2018). The quality of measurement results in terms of the structural features of the measurement process. Measurement, 116, 611–620. McCall, W. A. (1939). Measurement. New York: Macmillan. Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46(4), 806– 834. Michell, J. (1993). The origins of the representational theory of measurement: Helmholtz, Hölder, and Russell. Studies in History and Philosophy of Science Part A, 24(2), 185–206. https://doi.org/10.1016/0039-3681(93)90045-L Michell, J. (1999). Measurement in psychology: Critical history of a methodological concept. Cambridge, UK: Cambridge University Press. Peterson, J. J. (1983). The Iowa Testing Programs. Iowa City: University of Iowa Press. Peterson, N. S., Kolen, M. J., & Hoover, H. D. (1989). Scaling, norming, and equating. In R. L. Linn (Ed.), Educational Measurement, 3rd edition. New York: Macmillan, 221–262. Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research. Shepard, R. N. (1962). The analysis of proximities: Multidimensional scaling with an unknown distance function. Psychometrika, 27(2), 125–140. https://doi.org/10.1007/ BF02289630. Sherry, D. (2011). Thermoscopes, thermometers, and the foundations of measurement, Studies in History and Philosophy of Science. 42, 509–524. Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103(2684), 677– 680. https://doi.org/10.1126/science.103.2684.677. Stevens, S. S. (1951). Mathematics, measurement and psychophysics. In S. S. Stevens (Ed.), Handbook of Experimental Psychology, New York: Wiley, 1–49. Stevens, S. S. (1956). The direct estimation of sensory magnitudes: loudness. The American Journal of Psychology, 69(1), 1–25. https://doi.org/10.2307/1418112. Stevens, S. S. (1957). On the psychophysical law. Psychological Review, 64(3), 153–181. http s://doi.org/10.1037/h0046162. Stevens, S. S. (1975). Psychophysics: Introduction to its perceptual, neural, and social prospects. New York: Wiley. https://doi.org/10.2307/1421904. Stigler, S. M. (1986). The history of statistics: the measurement of uncertainty before 1900. Cambridge, MA: Harvard University Press. Suppes, P., & Zinnes, J. L. (1963). Basic measurement theory. In R. D. Luce, R. R., Bush, & E. Galanter (Eds.), Handbook of mathematical psychology. New York: Wiley. Terman, L. M. & Merrill, M. (1937). Measuring Intelligence. New York: Houghton Mifflin. Terman, L. M. & Merrill, M. (1960). Stanford-Binet Intelligence Scale. New York: Houghton Mifflin. Thorndike, E. L. (1910) The Measurement of the Quality of Handwriting. In E. L. Thorndike Handwriting. Teachers College Record 11 (pp. 86–151).

A History of Scaling 291

Thorndike, E. L., Woodyard, E., Cobb, M., & Bregman, E. O. (1927). The measurement of intelligence. New York: Bureau of Publications, Teacher's College, Columbia University. Thurstone, L. L. (1925). A method of scaling psychological and educational test. Journal of Educational Psychology, 16, 433–451. Thurstone, L. L. (1927a). Psychophysical analysis. American Journal of Psychology, 38, 368–389. Thurstone, L. L. (1927b). A law of comparative judgment. Psychological Review, 34, 273–286. Thurstone, L. L. (1927c). A mental unit of measurement. Psychological Review, 34, 415–423. Thurstone, L. L. (1928a). The absolute zero in intelligence measurement. The Psychological Review, 35(3), 175–197. Thurstone, L. L. (1928b). Attitudes can be measured. American Journal of Sociology, 33(4), 529–554. https://doi.org/10.1086/214483. Thurstone, L. L. (1938). Primary mental abilities. Chicago, IL: University of Chicago Press. Thurstone, L. L., & Chave, E. J. (1929). The measurement of attitude. Chicago, IL: University of Chicago Press. Tong, Y. & Kolen, M. (2010). Scaling: An ITEMS Module. Educational Measurement: Issues and Practice, 29(4), 39–48. Torgerson, W. S. (1952). Multidimensional scaling: I. theory and method. Psychometrika, 17, 401–419. Torgerson, W. S. (1958). Theory and methods of scaling. New York: Wiley. Tucker, L. R. (1948). A method for scaling ability test items in difficulty taking item unreliability into account. American Psychologist, 3, 309–310. Tucker, L. R. (1952). A level of proficiency scale for a unidimensional skill. American Psychologist, 7, 408. Wright, B. D. (1997). A history of social science measurement. Educational Measurement, Issues and Practice, 16(4), 33–45. https://doi.org/10.1111/j.1745-3992.1997.tb00606.x.

13 A HISTORY OF BAYESIAN INFERENCE IN EDUCATIONAL MEASUREMENT Roy Levy1 and Robert J. Mislevy

This chapter surveys the history of Bayesian inference in educational measurement and testing.2 In the present account, we interweave the history of Bayesian inference in educational measurement with the history of Bayesian inference more generally. Our focus will be on how Bayesian inference led to developments in educational measurement, but we will see the converse as well, as educational measurement and related areas were fertile grounds for the development of Bayesian inference. We first provide an overview of Bayesian inference, intended for readers unfamiliar with the key ideas and terminology. We then turn to the history of Bayesian inference, with sections on its origins and promotion from the late 18th Century, the subsequent rise of frequentist methods in the early part of the 20th Century, and then a revival of interest in Bayesian methods in the middle of the 20th Century. It is at this point that we join with the history of educational measurement with sections focusing on two key activities: scoring examinees and estimating measurement model parameters. The next two sections discuss the development in the 1970s of hierarchical approaches to modeling in educational measurement and the broader statistics community, and its application to estimating model parameters. We then briefly survey several key areas of application as Bayesian methods expanded from the 1970s to today. The following section reflects on the emergence of Markov chain Monte Carlo estimation, and the resulting explosive growth in Bayesian methods in educational measurement. The final two sections take stock of the past and the current state of affairs, and then look ahead to how Bayesian approaches are being employed on the frontiers of the field.

A History of Bayesian Inference 293

Overview of Bayesian Inference In this section we offer an overview of the basic machinery of Bayesian inference and how it is currently viewed. Our treatment here is brief. Readers interested in Bayesian inference generally are referred to Gelman et al. (2013), Bernardo and Smith (2000), Jackman (2009), and Kaplan (2014). Readers interested in Bayesian inference in the context of measurement models are referred to Fox (2010), Levy and Mislevy (2016), Almond et al., (2015), Novick and Jackson (1974), and Lee (2007). The approach to reasoning under uncertainty now commonly known as Bayesian inference extends far beyond Bayes’ theorem, but it is a fitting place to begin. Bayes’ theorem states that for known values of variables x and unknown values of variables θ, pð jxÞ ¼

pðx; Þ pðxj Þpð Þ ¼ / pðxj Þpð Þ: pðxÞ pðxÞ

(1)

In explicating this expression, we proceed with the typical case in which x contains observed data and θ contains unknown parameters. The expression on the left-hand side, p(θ|x), is the posterior distribution, a term that reflects how it captures what is believed about the parameters after having incorporated the information in the data. The numerator in the first equality, p(x, θ), is the joint probability distribution of the data and parameters. The second equality follows from factoring this joint distribution into two terms: p(x|θ) is the conditional probability of the data given the parameters, which is treated as a likelihood function for the parameters when the data are known; and p(θ) is the prior distribution for the model parameters, or what is believed about the parameters before the data. The denominator is the marginalPprobability of the observed data under the specifications of the model: pðxÞ ¼ pðxj Þpð Þ, with the sum being taken over all Ð possible values of θ for discrete parameters, or pðxÞ ¼ pðxj Þpð Þd in the case of continuous parameters. Because pðxÞ does not vary with different values of θ, it can be dropped from the expression above to yield a proportional relationship, as expressed on the right-hand side of (1). Viewing probabilities as expressions of belief (de Finetti, 1974), the posterior is an expression of what we believe about the unknown θ, having reasoned from the observed x through the model to revise initial beliefs also expressed as a probability distribution. As a distribution, posterior belief may be expressed, summarized, or communicated in the usual ways, including graphical displays such as plots of the density and scatterplots, or numerical summaries, such as point summaries of central tendency or variability, or interval summaries. We will see how, from these basic concepts and equations, a general approach to probabilistic inference in general and in educational measurement in particular has taken form.

294 Roy Levy and Robert J. Mislevy

Selected History of Bayesian Inference: Origins to the 20th Century In this and following sections we describe key historical developments of Bayesian inference with connections to educational measurement. Stigler (1986), Fienberg (2006), and McGrayne (2011) provided more detailed accounts of the history of Bayesian inference more generally. Bayes’ theorem itself dates at least to the publication of Reverend Thomas Bayes’ (1764) “An Essay Towards Solving a Problem in the Doctrine of Chances”.3 The paper was posthumously published through the efforts of Richard Price, who submitted the paper for publication in 1763, two years following Bayes’ passing. In modern framing, the essay focuses on a situation where the data are specified as following a binomial distribution with unknown parameter that governs the probability of success, and that parameter has a uniform prior distribution. These ideas gained attention less through the work of Bayes, and more through the work of Laplace. Beginning with his 1774 paper, “Mémoire sur la probabilité des causes par les évènements” and extending through his book, Théorie analytique des probabilités, first published in 1812, Laplace popularized Bayesian ideas that would continue to influence statistical practice for over a century. What we now call “Bayesian” inference was not referred to as “Bayesian” at the time. Rather, it came to be referred to as “inverse probability” (Fienberg, 2006), with the qualifier “inverse” highlighting the notion of inferring backwards—from effects to causes, from data to parameters, from x to θ—coherently within a joint probability model that includes both x and θ. That is, we construct the model with a particular flow, from causes to effects, from parameters to data, from θ to x, appearing as the term p(x|θ). Once effects/data/x are observed, inference proceeds in the opposite direction, arriving at p(θ|x). Though there were critics in the 19th Century, this mode of thinking was present if not pervasive in the statistical community up through to the beginning of the 20th Century, and what would later come to be termed Bayesian inference was a frequently employed if somewhat controversial perspective. (As we shall discuss further, a locus of controversy has been the prior probability p(θ) that is required in the inversion by way of Bayes’ theorem.)

Bayesian Inference and the Rise of Frequentism in the Early 20th Century Bayesian inference had its critics, but none that had an impact commensurate with that of Sir Ronald A. Fisher (McGrayne, 2011). Recognizing the terminology of the time, we can see Fisher taking straight aim at Bayesian inference, as with the unambiguous declaration in his 1925 book, Statistical methods for research workers:

A History of Bayesian Inference 295

For many years, extending over a century and a half, attempts were made to extend the domain of the idea of probability to the deduction of inferences respecting populations from assumptions (or observations) respecting samples. Such inferences are usually distinguished under the heading of Inverse Probability, and have at times gained wide acceptance. … [I]t will be sufficient…to reaffirm my personal conviction, which I have sustained elsewhere, that the theory of inverse probability is founded upon an error, and must be wholly rejected. (pp. 9–10) It is difficult to overstate the influence that Fisher, along with others such as Neyman, had in criticizing Bayesian inference and promoting their own methods. The main criticism concerned the propriety of framing parameters as random and the associated propriety of whether a prior probability p(θ) should be used, or even had meaning. Though there were of course disagreements among Fisher, Neyman, and others, amalgamations of their work from the 1920s through to the middle of the century quickly codified into what is now often lumped together under the heading of frequentist inference, which is distinguished from Bayesian inference in treating unknown parameters as fixed, rather than as random. Quite distinct from the question of whether it was appropriate to do so, those wanting to employ Bayes’ theorem faced the question of how to do so. In particular, computing the normalizing factor p(x) could be practically challenging. This is not a major obstacle in simple problems, but it becomes prohibitively difficult as problems grow in size. As the field of statistics developed in the first half of the 20th Century, Bayesian inference was hampered both by theoretical criticisms and practical limitations. This was also a time of rapid development in test theory and psychometrics. On the occasion of the 25th anniversary of the founding of Psychometrika, Gulliksen (1961) pointed out that “Many of the problems in test theory…are essentially problems of multivariate analysis in mathematical statistics” (p. 103). Gulliksen celebrated the idea of psychologists developing proficiency in statistics, and that statisticians were turning their attention to problems in test theory. We join in this celebration but hasten to point out that much of the development of statistical analyses in test theory has occurred during a time when frequentist methods have been dominant. This is not to say that Bayesian inference was dormant during this time. With Fisher and frequentism on the rise, the mantle of Defender of Bayesianism fell to Sir Harold Jeffreys, who sparred with Fisher in the 1930s over the nature of probability and the propriety of Bayesian inference. In a statistical world that was increasingly dominated by the inferential systems of Fisher, Neyman, Pearson, Wald, and others in the 1930s and 1940s, Jeffreys’ (1939) text Theory of probability served as a Bayesian outpost. While Jeffreys was the most prominent and public advocate for Bayesianism, the Italian scholar Bruno de Finetti was working in relative obscurity on topics

296 Roy Levy and Robert J. Mislevy

that would turn out to be enormously important for Bayesian methods, as well as for the connection between Bayesian inference and measurement modeling. Also known for espousing the subjective or epistemic view of probability—namely, that probability is an expression of beliefs or (un)certainty—de Finetti conducted groundbreaking work on exchangeability, including a theorem we elaborate conceptually as meaning that a set of variables are exchangeable if our beliefs about them are all the same. Working in the context of dichotomous variables, de Finetti (1931, 1937/1964) proved the theorem that bears his name, which has been extended to more general forms (see Bernardo & Smith, 2000). A general form of the theorem can be expressed for a set of J variables x1,…, xJ as ðY J pðx1 ; . . . ; xJ Þ ¼ pðxj j Þpð Þd : ð2Þ

j¼1

Conceptually, the theorem states that we can always express the joint distribution of the variables (the left-hand side) in terms of a series of conditionally independent and identically distributed (i.i.d.) distributions for the variables individually conditional on a parameter, and a distribution for the parameter (the right-hand side). In Bayesian terms, on the right-hand side we have a conditional probability for the variables, p(xj|θ), and a prior distribution for the parameter introduced in that conditional distribution, p(θ). De Finetti’s representational theorem came to be seen as a powerful argument and tool. The argument is that exchangeability is the central assumption, which gives rise to the introduction of a parameter in the model, the (prior) distribution for the parameter, and the subsequent Bayesian calculations (e.g., Bernardo & Smith, 2000; Jackman, 2009). As a tool, it permits the analyst to specify models by using simpler, familiar i.i.d. conditional distributions for variables, and a distribution for the parameter(s) introduced in specifying those distributions. While the theorem says nothing about the forms of θ, p(θ), or p(xj|θ) in a particular problem—that is up to an inferrer to explore, drawing on theory, experience, and knowledge of the context at issue—it does ground such reasoning and exploration in the framework of mathematical probability. The right-hand side of de Finetti’s theorem may be represented graphically by the structure in Figure 13.1, which depicts the xs as being modeled as dependent on θ, and conditionally independent of one another given θ. Remarkably, this picture and structure is exactly that of many modern measurement models. Classical test theory (CTT), item response theory (IRT), factor analysis (FA), latent class analysis (LCA) and their extensions all share this basic structure, differing primarily in their distributional specifications for the xs and θ. At their core, such models view the xs as observables that are conditionally or locally independent of one another given a latent variable θ.4

A History of Bayesian Inference 297

θ

x1

FIGURE 13.1

x2

x3

---

xJ

Graphical representation of the right-hand of de Finetti’s theorem, depicting conditional independence of the xs given θ. Alternatively, a graphical representation of the core structure of many measurement models. Reproduction of Figure 3.5 from Levy & Mislevy (2016)

But we are getting a little ahead of ourselves. The connection between measurement modeling and de Finetti’s work was not recognized at the time, due in no small part to the fact that measurement modeling was in its relative infancy. Further, the import of de Finetti’s work was not immediately recognized by the Bayesian community, such as it was. This would change in the 1950s.

Revival in the Mid-20th Century The 1950s and 1960s saw broadening interest in Bayesian methods (Fienberg, 2006; McGrayne, 2011), marked by texts such as Good (1950), Savage (1954), and Raiffa and Schlaifer (1961). Raiffa and Schlaifer developed the notions and forms of conjugate prior distributions, wherein a prior distribution of a particular form combines with a likelihood to yield a posterior distribution of the same form. These situations yielded closed-form solutions, and could therefore be employed in practical work, as the calculations of the normalizing term p(x) in Equation 1 otherwise become impractical as problems became more complex. Further, the availability of a range of prior distributions helped to nudge Bayesians away from limiting the applications to priors that gave equal weight to all possible values (i.e., uniform priors), a strategy that was present in the original works of Bayes and Laplace (though Laplace did consider the more general form), and had been taken by some to be a requirement for the use of Bayesian inference. Another set of developments came in the form of combining evidence from across observations and across sources, such as through empirical Bayes methods (Robbins, 1956). Empirical Bayes methods use other (typically past) results to form the prior distribution, operating in a spirit similar to that articulated by Karl Pearson on the use of existing data viewed to be relevant by the analyst to form beliefs (Pearson, 1941; see also Fienberg, 2006). A striking example of this sort of work at this time was that done by John Tukey, along with David Brillinger and David Wallace, on election forecasting,

298 Roy Levy and Robert J. Mislevy

including real-time updating on election night (Fienberg, 2006; McGrayne, 2011). Various forms of data were available, including past voting results at various levels (e.g., by county), polls preceding election night, predictions from political scientists, and completed and partial returns flowing in during the night. These data were used to construct the prior distribution, which was particularly helpful in yielding estimates for locations with relatively less information. The analyses for any one location amounted to “borrowing strength” from others, to use Tukey’s term. It is also at this time that Bayesian ideas begin to explicitly make their way into quantitative psychology and educational measurement. In a landmark paper, Edwards, Lindman, and Savage (1963) noted that no textbook existed that covered all of these Bayesian ideas and procedures for a target audience of experimental (quantitative) psychologists. Furthermore, they doubted one would exist soon thereafter, on the grounds that “Bayesian statistics as a coherent body of thought is still too new and incomplete” (p. 193). Recognizing the void, Edwards et al. (1963) sought to “introduce psychologists to the Bayesian outlook in statistics” (p. 193) by covering key ideas and how they departed from those of frequentist procedures that had taken hold. The coming years would turn out to be an active time for development of a coherent body of thought, and such targeted books would arrive a decade later (Box & Tiao, 1973; Novick & Jackson, 1974). In educational measurement and related contexts Meyer (1964, 1966) advocated for the use of Bayesian methods, and argued for its role in inference, policy, and decision-making. These publications marked educational measurement and quantitative psychology as being among the first disciplines to give attention to Bayesian methods in the wake of the revival of Bayesian methods in the 1950s and 1960s. As we will see, this will not be the last time that education, and measurement in particular, played the role of early adopter and test bed for new Bayesian procedures. This is not to say that Bayesian ideas did not already exist in measurement and test theory, as reviewed in the next section; they just weren’t known as such.

Scoring in Measurement In the 1920s, Truman Kelley conducted foundational work on scoring examinees in CTT. In one formulation, Kelley’s formula states that the best estimate for an examinee’s true score is given by ~i ¼ xi þ ð1  Þ x ¼ x þ ðxi  x Þ;

ð4Þ

where xi is the observed test score for examinee i, μx is the mean of the test ~i

 scores, 2 2 2 2 2 is the estimate of the true score (θ) for examinee i, and  ¼  =x ¼   þ e is the reliability of the test, with 2 , 2x , and 2e denoting the variances of the true scores, observed test scores, and errors, respectively. This relation dates at least to Kelley in 1923 (p. 214).

A History of Bayesian Inference 299

The formula in Equation 3 takes on an empirical Bayes flavor when samplebased estimates of x and ρ are used. The result is that the estimate for an examinee is not based purely on her or his observed score, but also the scores of others through the mean, “borrowing strength” from the scores from other examinees. The formula effects a balancing between two sources of information: (1) the examinee’s score, and (2) the scores of everyone else, as captured by the mean. How much weight to put on each source is captured by the reliability, which in this light is seen as an expression of the evidentiary quality of the test: the larger the evidentiary quality (reliability), the more weight is given to the individual test score; the smaller this evidentiary quality, the more weight is given to the group mean. Kelley’s formula for the estimated true score in Equation 3 and the associated standard error of estimation, pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi  jx ¼  ð1  Þ; ð5Þ are typically framed as the result of regressing true scores on observed scores, but it can indeed be seen as an instance of Bayesian reasoning; the estimated true score and standard error of estimation are the mean and standard deviation of the posterior distribution for an examinee’s true score, assuming normality and known population parameters (Levy & Mislevy, 2016). This remarkable connection between Kelley’s formula and Bayesian inference was not recognized immediately. Robbins (1960) discussed the conditional probability of θ given x for scoring in CTT contexts but did not mention the connection to Kelley’s formula or that this explicitly was an instance of Bayesian reasoning. To our knowledge, Cronbach and Gleser (1965, p. 154) were the first to discuss the connection, with Novick5 (1969) being the first to show that starting from Bayesian principles leads to Kelley’s formula. A similar historical trace occurred in FA, which in a sense generalizes CTT. Arguing for the use of Bayesian inference to estimate the latent variables, Bartholomew (1981) showed that, under the assumptions of normality of the latent variables and errors, the posterior distribution for an examinee’s latent variables is normal, and that the mean is equal to what are commonly referred to as “regression factor scores” (Thurstone, 1935). While Bayesian ideas lurked in CTT and FA scoring approaches dating to the 1920s and 1930s, an explicit description of it as a scoring mechanism appeared in an article by Calandra (1941), who noted difficulty in executing such scoring in the context of binomial models. As binomial models can be seen as the form of much educational and psychological testing data, difficulties in execution may be the reason the approach seems to have received no attention at that time. One of the first measurement models to transparently apply and recommend Bayesian reasoning to scoring was LCA. It is present in Lazarsfeld’s foundational work (1950, p. 430), and in Lazarsfeld and Henry’s classic 1968 text (section 3.5).

300 Roy Levy and Robert J. Mislevy

The use of Bayes’ theorem for scoring is ubiquitous in the earliest uses of LCA in measurement contexts (Birnbaum & Maxwell, 1960/1965; Herman & Dollinger, 1966), and continues to be used in more recently developed extensions or variants that employ discrete latent variables, including diagnostic classification models (Rupp, Templin, & Henson, 2010), Bayesian knowledge tracing (Corbett & Anderson, 1995), and Bayesian networks (Almond et al., 2015). Indeed, the use of the term “Bayesian” in the names of several of these models expresses that inference about examinees is facilitated by the use of Bayes’ theorem. At about this same time, Bayesian approaches to scoring were being developed and studied in IRT for a variety of models, data types, and contexts (e.g., Birnbaum, 1969; Bechtel, 1966; Owen, 1969; Samejima, 1969). The use of the posterior mode (Samejima, 1969) and the posterior mean (Birnbaum, 1969) as estimators for an examinee’s θ also emerged at this time. A second point emerging was addressing the difficulty of evaluating the integrals involved. A popular solution to the problem of evaluating the integrals was introduced by Bock and Aitkin (1981), who developed the use of numerical quadrature approaches, which made the computation of the posterior mean tractable. The use of Bayesian estimation for θs in IRT solved some critical practical problems. In particular, the maximum likelihood estimate (MLE) for so-called perfect response patterns (all 1s or all 0s on dichotomous items) are infinite. Whereas multiple maxima may exist, corrections to MLEs may introduce biases, and numerical estimation methods may fail, the use of Bayesian inference for scoring can solve all these problems, in particular yielding finite estimates (Lord, 1986). This represents but one instance in which Bayesian inference offers ways to overcome difficulties with frequentist approaches, even when viewed from a frequentist perspective.6 From the vantage point of the present, we see the strong connection between Bayesian modeling and the central measurement goal of producing scores for examinees. Popular psychometric models that emerged in the 20th Century have precisely the form that results from applying de Finetti’s theorem to a set of exchangeable observables (Figure 13.1). Though they were not explicitly derived as such, these models can be cast as resulting from the application of that distinctly Bayesian thinking: asserting exchangeability motivates the form, in which the observables are modeled as dependent on a parameter that renders them conditionally independent. The model is a story expressed in a probability framework, for the observables conditional on the parameter, (p(x|θ)), and for the parameter, (p(θ)). The terms of this story are precisely the ingredients needed to take the next step, to facilitate inference about the parameter once data are observed—in measurement, scoring—via Bayes’ theorem as per Equation 1. This is not the last we will hear from de Finetti’s exchangeability theorem. As we will see, it continued to play a crucial role in the development of Bayesian methods generally, and in measurement modeling.

A History of Bayesian Inference 301

Early Efforts in Estimating Additional Model Parameters By the late 1960s and early 1970s, the measurement community had recognized that Bayesian approaches to scoring were operative in some traditions (CTT, LCA) and were on the verge of recognizing or introducing them in others (IRT, FA). Shortly thereafter, psychometricians began to use Bayesian methods for estimating other parameters, such as measurement model parameters and the parameters that govern the distributions of the examinee parameters and measurement model parameters.7 Shortly after Novick (1969) connected Bayesian inference and Kelley’s formula in CTT, Novick et al. (1971) provided an expanded treatment that developed Bayesian approaches for inferences about the true scores as well as true score variance, observed score variance, and reliability. Interestingly, Novick et al. (1971) motivated their development of Bayesian approaches by noting that frequentist approaches can yield negative estimates of the true score variance. A similar concern motivated Martin and McDonald’s (1975) paper developing Bayesian approaches in exploratory (unrestricted) FA to address the possibility of Heywood cases (zero or negative unique variances). This work appeared a few years after Bayesian approaches to FA appeared in the economics literature as a method to resolve rotational indeterminacies (Kaufmann & Press, 1973). More developments in Bayesian approaches to fitting measurement models to estimate parameters were to arrive at this time and in the years to come. But to properly tell that story, we need to discuss a new generation of Bayesian modeling that was taking hold.

Rise of Hierarchical Modeling A key development in the late 1960s and early 1970s—perhaps the key development—was that of a hierarchical approach to modeling and reasoning. One way to view hierarchical models is as a process of constructing models. We begin by seeking to specify a distribution for the data, rather than parameters per se (Bernardo & Smith, 2000). Asserting exchangeability of the data, we can invoke de Finetti’s theorem and structure the distribution of the data conditional on parameters, as p(x|θ). If the parameters are unknown, then a prior distribution is required. In complicated cases with multiple parameters, this becomes more challenging. Suppose θ = (θ1,…, θn) are these parameters. In a landmark paper, Lindley and Smith (1972) argued that if we assert exchangeability with respect to the parameters, we may proceed by invoking de Finetti’s theorem and specifying a distribution for each element, conditional on other, higher-order parameters or hyperparameters, denoted as θH. To connect this to an earlier discussion, let us revisit the situation of scoring examinees. Suppose we have n examinees, where θi denotes the true score or latent variable for examinee i. Treating the θs as exchangeable warrants specifying a common distribution, so that the specification for the θs becomes p(θi|θH);

302 Roy Levy and Robert J. Mislevy

  2 further assuming, say, normality  (as is2 common in CTT), H ¼ ;  and the specification becomes i N ;  . This is the second level in the hierarchy, with the first being the conditional   probability of the observable data, again assuming normality xi N i ; 2e . The hierarchy structures the synthesis of all the information from across the data. In the context of scoring, information from the data across all examinees flows up the hierarchy, from the xs to the θs and up to μθ and 2 . This information then flows down the hierarchy to help shape estimates for the individual θs, such as through Kelley’s formula in (4) and (5). Here we see the hierarchical structure facilitating the “borrowing of strength” across observations, in the spirit of Kelley’s and Tukey’s procedures. As such, CTT may be framed as a hierarchical model, with the exchangeability assumptions regarding the θs generating the hierarchical structure. The same can be said for other measurement models, including IRT, FA, and LCA. So far, then, we have considered how we may view our measurement models as exhibiting a hierarchical structure for the examinee parameters. The development of the hierarchical approach was closely intertwined with other lines of work in educational measurement. Up to this time, Lindley had been involved in projects led by Melvin Novick, and conducted research on Bayesian approaches to the problem of predicting performance at colleges using test scores (Lindley, 1969b, 1969c, 1970). Importantly, this context involved specifying a regression model for multiple groups (one for each college), some too small to support stable estimates from their data alone. This situation was ripe for capitalizing on assuming exchangeability of the regression parameters over groups (e.g., Jackson, Novick, & Thayer, 1971; Novick, 1970). Lindley and Smith’s 1972 paper brought these methods to the attention of a broader statistical audience. However, it would take more than two decades before these approaches gained popularity in the broader Bayesian community (McGrayne, 2011) to become a dominant approach to formulating Bayesian measurement models (Fox, 2010; Levy & Mislevy, 2016) as well as statistical models more broadly (Gelman et al., 2013; Jackman, 2009). Having emerged in the context of educational research, it is unsurprising that hierarchical approaches would be adopted relatively earlier by those in the educational research community. Following the foundational work just reviewed, Bayesian approaches to prediction models in educational settings continued to attract interest and develop in the following years, well before the hierarchical approach took hold in the larger statistical community (e.g., Gross, 1973; Novick, Lewis, & Jackson, 1973; Rubin, 1981). As discussed next, there was another area in educational measurement that was quick to leverage Bayesian hierarchical approaches.

Application of Hierarchical Modeling to Estimating Measurement Model Parameters Having recognized the logic of a hierarchical construction in the estimation of true scores inherent in Kelley’s formula, Jackson (1972) developed hierarchical

A History of Bayesian Inference 303

approaches to estimating other parameters, including error variances in CTT. A decade later, psychometricians began to leverage hierarchical approaches to model measurement model parameters in FA (Lee, 1981) and IRT models of varying types and distributional assumptions (e.g., Jansen, 1986; Mislevy, 1986; Swaminathan & Gifford, 1982, 1985, 1986; Tsutakawa & Lin, 1986). These authors emphasized the practical benefits of Bayesian over ML estimation, including obtaining reasonable estimates of often hard-to-estimate parameters, such as discrimination and lower-asymptote parameters in IRT models. Early attempts to address this were fairly ad hoc, and the rise of Bayesian approaches in the 1980s represents a model-based solution to this practical problem (Lord, 1986). Importantly, the success of Bayesian approaches in this space is not because they are a computational trick. Rather, it is because Bayesian approaches allow the analyst to bring to bear substantive information about the parameters into the model directly (e.g., that variances are nonnegative, that discrimination parameters are not infinite) or by “borrowing strength” across the hierarchical construction (e.g., Martin & McDonald, 1975; Mislevy, 1986).

Expansion of Applications The mid-century revival of Bayesian inference led the measurement community to look backward and forward: backward to recast earlier developments in Bayesian terms, and forward to advance the field of measurement by way of using Bayesian approaches. The following subsections briefly review areas of application of Bayesian methods in educational measurement and statistics that drew attention in the 1970s and 1980s and exist today in operational assessment as well as research programs.

Criterion-Referenced Testing In the 1970s, researchers began investigating Bayesian approaches to tackle problems in criterion-referenced testing. This work was closely related to the research on prediction reviewed above. Bayesian efforts addressed this principal problem of classifying students with respect to the criterion, as well as problems of determining test length, setting cut scores, evaluating reliability, and marrying probabilistic classifications with utilities in a decision-theoretic frame (e.g., Hambleton & Novick, 1973; Huynh, 1976; Novick & Lewis, 1974; Swaminathan, Hambleton, & Algina, 1975).

Computerized Adaptive Testing Owen (1969, 1975) pioneered Bayesian approaches to updating beliefs about examinees and task selection in adaptive testing. The latter echoes work done by

304 Roy Levy and Robert J. Mislevy

Raiffa (1961), who viewed task selection as an exercise in sequential design amenable to Bayesian inference. These approaches were subsequently applied and studied beginning in the 1970s, with continued developments for the role of Bayesian inference in these and other aspects (e.g., Jensema, 1974; Urry, 1977; Weiss & McBride, 1984). We note that even if frequentist methods are used in CAT, the logic of updating beliefs about examinees in CAT may be seen as quite naturally Bayesian: combine what was believed before this item was administered (i.e., the prior) with the information in the just-observed response (the data) to yield updated beliefs (the posterior).

Intelligent Tutoring Systems A similar logic can be applied to analyses in intelligent tutoring systems, which adapt the next experience (e.g., a task, a hint, some instruction) for the student based on their recent performance. Formal Bayesian approaches emerged in the 1990s, as researchers began employing Bayesian networks in intelligent tutoring system contexts (e.g., Mislevy, 1994; Villano, 1992). Corbett and Anderson (1995) melded Bayesian updating with a mathematical learning model to produce an approach to student learning across repeated attempts on tasks called Bayesian knowledge tracing. Reye (2004) demonstrated how this and other intelligent tutoring system approaches could be framed in terms of dynamic Bayesian networks. These networks remain a popular approach for modeling data in these systems and related environments with similar longitudinal structures, such as game-based assessments (see Alkhatlan & Kalita, 2019; Reichenberg, 2018).

Modeling Item Families Concurrently, researchers such as Embretson (1984) were drawing on developments in cognitive psychology to define constructs, construct items, and build measurement models to analyze resulting data. An idea particularly suited to Bayesian treatment was generating families of items (by humans or algorithms) around theory-based schemas (e.g., Gierl & Lai, 2012), and modeling item parameters conditional on cognitively-relevant features (e.g., Glas & van der Linden, 2001; Johnson & Sinharay, 2005). Bayesian hierarchical modeling correctly propagates uncertainty from within-family item-parameter variation to inferences about examinees and item selection. What’s more, the complexity of models employing variables representing aspects of tasks as explanatory covariates led researchers to turn to Bayesian approaches in some cases on purely pragmatic grounds—models that were otherwise prohibitively difficult could be employed using Bayesian techniques (e.g., Geerlings, Glas, & van der Linden, 2011; Jeon, Draney, & Wilson, 2015).

A History of Bayesian Inference 305

Missing Data Modeling A few years after the promotion of hierarchical models, the statistics community saw another development with strong connections to educational measurement in the form of missing data analyses. Rubin (1976) laid out a framework that has since become the standard way of thinking about and addressing missing data. Subsequent work focused on Bayesian approaches to nonresponse in surveys (Rubin, 1977), and the development of multiple imputation procedures (Rubin, 1978), which rely on Bayesian approaches to impute missing values that can then be used to support Bayesian or frequentist inference. This approach is the grounding for procedures used to analyze data from largescale surveys where the targeted inferences are about groups, rather than individuals, such as the National Assessment of Educational Progress (NAEP; Oranje & Kolstad, 2019). Beginning with the 1983–1984 administration of NAEP, researchers introduced the “plausible values” method to this end (Mislevy, Beaton, Kaplan, & Sheehan, 1992). Rather than seeking point estimates of examinee θs, this method involves imputing multiple values for examinee proficiency variables in the manner of multiple imputation from the appropriate distributions. Plausible values and related Bayesian methods continue to be used and researched in the context of large-scale assessments (Johnson & Jenkins, 2005; Oranje & Kolstad, 2019).

Bayesian Networks / Belief Networks / Graphical Models In the 1980s, work in artificial intelligence began to develop in ways that would soon influence educational measurement in several ways. This work concerned expert systems, which are programs intended to mimic or reflect expertise, reasoning, and decision making. Statisticians of a Bayesian bent began to advocate for the use of what have been referred to as Bayesian networks, belief networks, and graphical models (e.g., Pearl, 1988, 2018). Key to this development was building around theoretically- and experientially-motivated relationships that could take advantage of conditional independence structures. It does not escape notice that this high-level phrase describes the basic models of measurement and assessment. This work helped to underscore the role of Bayesian inference and these networks/graphical models as a mechanism for inference and evidentiary reasoning more generally, including in assessment (Mislevy, 1994). The use of Bayesian networks in assessment followed shortly thereafter. Early applications included the diagnostic and intelligent tutoring system environments noted above, and has since expanded to simulation- and game-based assessments (Culbertson, 2016; Reichenberg, 2018). Additionally, it is in this space of graphical representations of networks that we see certain developments in procedures for updating the networks that were to have a profound impact on all of Bayesian modeling.

306 Roy Levy and Robert J. Mislevy

The Era of Markov Chain Monte Carlo Estimation By the late 1980 and early 1990s, Bayesian methods in measurement had shown promise for solving practical problems and offering a coherent approach to modeling and inference. However, they still faced a major obstacle in that executing a Bayesian approach could be computationally challenging. Closed-form solutions for posterior distributions that were available in limited situations did not extend to more complicated measurement models. It is at this time that the Bayesian community discovered that Markov chains could be used to simulate values that, with a sufficiently long chain, could be viewed as draws from the posterior distribution (see Robert & Casella, 2011, and McGrayne, 2011, for extended accounts of this history). The breakthrough came with the work of Gelfand and Smith (1990), who tied preceding ideas and efforts together, and showed how Gibbs sampling provided a way to estimate the posterior distribution in a Bayesian analysis. At the same time, the research program on graphical modeling emerging from the artificial intelligence community (reviewed above) bore additional fruit, which was to have wide-ranging impact, in the form of what was to become the BUGS software project. The goal was to develop a general approach and program that could provide Bayesian inference for a wide-class of models rather than being limited to particular situations (Lunn, Spiegelhalter, Thomas, & Best, 2009). The combination of these efforts left Bayesians in an exciting but unfamiliar spot: with a very flexible set of procedures that in principle could be leveraged for estimating posterior distributions in otherwise intractable situations, and software to implement these procedures for an arbitrarily specified model. These revelations touched off a flurry of activity on the theory and practice of MCMC. It is difficult to overstate the transformative impact this work has had for the use of Bayesian approaches in fitting models, including measurement models. Prior to the advent of MCMC, Bayesian approaches could be seen as conceptually elegant but often impractical, as it was difficult at best to execute a Bayesian analysis outside of simple situations with closed-form solutions and doing so might only be viable through approximations. With MCMC, Bayesians could practically work with a much broader class of models, including models that could not be employed in frequentist settings. In measurement modeling, Albert (1992; Albert & Chib, 1993) conducted seminal work showing how posterior distributions for parameters in IRT models could be estimated using MCMC. In the finance literature, Geweke and Zhou (1996) applied a similar estimation approach to estimation in FA. A series of papers in the late 1990s marked a turning point for MCMC in the measurement modeling community, with developments in FA (Arminger & Muthén, 1998; Scheines, Hoijtink, & Boomsma, 1999; Shi & Lee, 1998), LCA (Hoijtink, 1998; Hoijtink & Molenaar, 1997), and IRT (Patz & Junker, 1999a, 1999b). The floodgates now opened, subsequent years saw an explosive rise in the use of Bayesian models and MCMC estimation in several ways. First, a litany of

A History of Bayesian Inference 307

research showed how existing models could be estimated using MCMC, comparing the results to those from frequentist estimation. Second, research was conducted on the performance of Bayesian models for doing the sort of work we need to do in assessment (e.g., scoring, classification, investigating differential functioning). Third, new models that might otherwise be intractable without MCMC were developed across a variety of modeling paradigms for a variety of purposes (see Levy, 2009, and Levy & Mislevy, 2016, for additional references and discussion). Fourth, as a flexible approach to fitting models, MCMC also supports ways to evaluate, critique, and compare measurement models, through approaches including residual analyses, posterior predictive checks, information criteria, and Bayes factors (Levy & Mislevy, 2016; Sinharay, 2016). Such tools flesh out the Bayesian toolkit, permitting analysts to evaluate their models in ways that are comparable to, or in certain ways advantageous over, conventional methods. As examples, posterior predictive checks support examining a variety of aspects of model-data fit without relying on asymptotic or derived sampling distributions, and Bayes factors offer notions of evidence in favor of models, not just against models as in hypothesis testing. Prior to the advent of MCMC, researchers had yet to harness the full power of Bayesian approaches in situations where they might be desirable, in particular in the propagation of uncertainty when estimating models in stages. With MCMC, Bayesian measurement models of increasing complexity could be employed, and could be integrated with other design features, such as the multilevel nature of groupings of examinees, complex sampling designs, and structural relationships of proficiency variables to covariates and outcomes, all the while expressing and propagating uncertainty (Johnson & Jenkins, 2005). Two decades into the age of MCMC, the growth of Bayesian models in measurement continues unabated.

Taking Stock of Where We Have Been and Where We Are Now Educational measurement began to crystalize and mature in the early 20th Century, a time when Bayesian inference was falling out of favor due to the critical stances taken by influential statisticians. Nevertheless, Bayesian inference did have its defenders and advocates in the statistics community, and important foundational work, such as de Finetti’s on exchangeability, was being done. From today’s vantage point, we can see that de Finetti’s thinking (eventually) revolutionized thinking about inference and about probability itself—and in a way that benefitted psychometrics, and educational measurement in particular. It is striking that the Bayesian model under exchangeability that permits treating the variables as conditionally independent given a parameter has the same form as many of the psychometric models that emerged in the 20th Century (Figure 13.1). Our view is that this is no mere happy accident. Pearl (1988) stressed the importance of specifying variables that induce conditional independence

308 Roy Levy and Robert J. Mislevy

relationships in statistical modeling and reasoning more broadly, as key components in the machinery of evidentiary reasoning. This view allows us to cast psychometric models as facilitating evidentiary reasoning in measurement (Mislevy, 1994, 2018). The models are machinery for organizing our thinking, and the person parameters (i.e., latent variables) and the resulting conditional independence they induce among observable variables are parts of that machinery. The models are set up with a particular “flow,” from parameters to observables. Bayes’ theorem is the part of the machinery that facilitates reasoning in the opposite direction of this “flow” within the same probabilistic framework—to use the older terminology, reasoning by way of “inverse probability.” By the middle of the 20th Century, theoretical debates (and no small amount of hostility) between frequentists and Bayesians were well-established in the general statistics community (McGrayne, 2011). Our view is that the measurement community neither participated in, nor was greatly influenced by these theoretical debates. Work in the field proceeded with frequentist methods, reflecting the dominance of those methods since the crystallization of educational measurement at the beginning of the century. It was when the field tackled practical problems for which the usual approaches were less satisfactory, and Bayesian concepts offered promise, that we begin to see Bayesian methods appear and be employed in educational measurement. The renewed interest in Bayesian approaches in the late 1960s and 1970s led to new ways of model-building and decision-making, through hierarchical specifications warranted by exchangeability and shaped by theory, experience, and knowledge about the situation of interest. These developments in Bayesian statistics were grounded in, or found immediate application in, educational measurement to solve a number of practical problems, notably in hierarchical models for prediction and decision making in criterionreference or adaptive testing scenarios. Then, in the late 1990s and early 2000s, the advent of MCMC enabled Bayesians to do all the things that frequentists could do, and then some. The marriage between hierarchically-based model-building strategies and MCMC estimation has been a productive one. Now that MCMC has made Bayesian analysis more feasible, we anticipate that the future of Bayesian methods in educational measurement and assessment will trade on the adoption of Bayesian principles in novel situations. Recapping the applications surveyed so far, we can see that Bayesian inference has been used to accomplish a number of different goals, with different features emphasized in these different use cases. Some of the ways that Bayesian inference has been used in assessment include as:   

an updating mechanism (e.g., Owen, 1969; Reye, 2004); a tool to rein in or regularize estimates (e.g., Martin & McDonald, 1975; Mislevy, 1986); a hierarchical approach to model construction and inference that borrows strength for inferences about people and parameters (e.g., Hambleton &

A History of Bayesian Inference 309

   

Novick, 1973; Kelley, 1923; Mislevy, 1986) and allows us to work at different levels of multi-leveled systems as in education (e.g., Mislevy et al., 1992); a way to express beliefs about people and situations probabilistically (e.g., Lazarsfeld & Henry, 1968; Mislevy, 1994); an integral part of decision making (e.g., Owen, 1969, Swaminathan et al., 1975); a way to manage uncertainty due to measurement error and other sources (e.g., Novick et al., 1971; Rubin, 1977); and a flexible approach to modeling that enables the analyst to integrate measurement models in larger models that capture additional features of the situation (e.g., Johnson & Jenkins, 2005) and better represent substantive beliefs through constraints and specifications, permitting models that may be unidentified or intractable from other perspectives (e.g., Muthén & Asparouhov, 2012).

Looking Ahead to Where We Might Go With apologies to Lindley8, the variety of settings and uses of Bayesian inference prompts us to say that Bayesian inference is not just a branch of measurement modeling and inference, it is a way of looking at all of measurement modeling and inference. More generally, the rise of Bayesian inference reflects and facilitates a larger shift in thinking about assessment and what it is all about. We can more properly see assessment as an exercise in evidentiary reasoning, reflecting a particular approach to reasoning about problems in the real world, and Bayes’ theorem is machinery for conducting that reasoning as we update our beliefs about what is unknown in light of what becomes known. As a consequence, our view is that Bayesian inference is well-suited to tackle the challenges and opportunities that arise as assessment moves from familiar to more innovative settings. We note a few areas where work has already begun. Whereas assessment activities have historically been separated from learning activities, stealth assessment seeks to integrate them, such that the assessment need not be noticed as a distinct activity (Shute, 2011). Along these lines, we can formally incorporate the effects of activities such as instruction that have hitherto been largely outside our measurement model (Arieli-Attali, Ward, Thomas, Deonovic, & von Davier, 2019). These and other related efforts echo the themes of modeling item families previously discussed, which trade on a synthesis of cognitive theory, task design, and response modeling with familiar psychometric models. Such an integration is furthered still when we consider new, richer forms of data and analytic methods to identify and synthesize evidence—which too can be carried out within a Bayesian perspective (Mislevy, 2018, Ch. 16). An important component in assessment involves laying out evidence identification rules, which may be loosely thought of in terms of what aspects of

310 Roy Levy and Robert J. Mislevy

performance are going to be discerned, and what in those aspects would constitute better and worse performance, or, perhaps what in those aspects hold qualitatively different evidence to shape feedback for learning. Historically, those judgments only had to deal with the limited data available and could be largely specified ahead of time, the classic example being correctness of response in selected-response tasks. With digital recording capabilities in ever-richer environments, we can conceivably capture every keystroke, click, or other action (e.g., eye movements). Elements of evidence, x, need not be restricted to predefined encapsulated item responses or ratings, but can extend to characterizations of evidentiary patterns in performances that are continuous, interactive, serially dependent, multimodal, or extend over time. Bayesian approaches to modeling performance in the absence of predetermined evidence identification rules, and the discovery of the rules themselves, have been developed for certain types of response data (Karabatsos & Batchelder, 2003) and could be further leveraged. Similarly, Bayesian approaches to marshalling and synthesizing the evidentiary impact of data arriving from varying contexts—a type of multiple measures— have been developed and could be further pursued (e.g., Bergner, Walker, & Ogan, 2017). As a final example, historically most assessments have been designed to be administered individually. Assessments that allow for and/or focus on collaboration pose challenges in terms of teasing apart the role of the individual from the group or larger context. Here again, Bayesian approaches to organizing and processing the evidentiary information in data has proven useful in early efforts (e.g., Andrews et al., 2017). This last example is also reflective of a larger shift in the psychological theory assumed (often implicitly) to underlie assessment arguments. This is a shift from conceiving of the examinee as an individual, seemingly in isolation, to recognizing and indeed seeking an understanding of the person’s behavior as situated in a context, where context is taken to mean not only the features of the situation in which they are acting, but also their history and their perceptions of the situation (Mislevy, 2018). Once again, Bayesian approaches may be useful in helping us structure our thinking and conduct our reasoning, recognizing the inter- and intra-personal variation and relevant contextual variables to be used for conditioning our inference.

Conclusion In this chapter we have attempted to tell a history of Bayesian inference in educational measurement. In doing so we have highlighted developments and insights we view as key to the ways Bayesian inference has (and has not) been used in educational measurement. We have seen that educational measurement as a field has not only adopted Bayesian methods, but has motivated and served as a proving ground for them as well.

A History of Bayesian Inference 311

Our account is incomplete for two reasons. First, we recognize that we have almost certainly omitted work from our account that other scholars would deem relevant. Second, the history of Bayesian inference in educational measurement is still unfolding, as its capabilities and uses keep expanding, and the reciprocal relationship between Bayesian inference and educational measurement strengthens and expands. We look forward to what the future will bring, and what scholars will have to say about the next eras of Bayesian inference in educational measurement.

Notes 1 The authors wish to thank Daniel Bolt, Matthew Johnson, Sooyeon Kim, Sandip Sinharay, and the editors for their reviews and helpful comments on an earlier draft of this chapter. 2 A related account can be found in Sinharay (2006). 3 We say this dates “at least” to Bayes’ 1764 paper, because there is some debate about whether or the extent to which Bayes was the first to lay out the theorem that has come to bear his name (Fienberg, 2006). 4 Each of those models may be extended in various ways, accommodating multiple θs, or additional measurement parameters that govern the distribution of the xs given θ. The graph would then expand accordingly (Levy & Mislevy, 2016), but this basic structure is enough to draw the link between de Finetti’s exchangeability theorem and what goes in in many measurement models. 5 In other writing, Novick (Novick & Jackson, 1970; Novick, Jackson, & Thayer, 1971) credits Lindley with this development. In particular, Lindley (1969a) derived the posterior mean for a true score from a Bayesian analysis, but in that writing did not connect it with Kelley’s formula. 6 However, Bayesian estimates of θ are generally biased. This problem is largely rectified in the weighted likelihood estimator proposed by Warm (1989). Though derived from a different perspective, this estimator is equivalent or nearly so to the MAP estimate under Jeffreys’ (1946) “transformation invariant” vague prior distribution for commonly used IRT models (Magis & Raîche, 2012). 7 The inclusion of measurement model parameters and parameters that govern the distributions of the examinee and measurement model parameters in a Bayesian model may be seen as extending the basic structure of de Finetti’s exchangeability theorem (Levy & Mislevy, 2016). 8 McGrayne (2011, p. 107) quoted Lindley as saying “Bayesian statistics is not a branch of statistics. It is a way of looking at the whole of statistics.”

References Albert, J. H. (1992). Bayesian estimation of normal ogive item response curves using Gibbs sampling. Journal of Educational and Behavioral Statistics, 17, 251–269. Albert, J. H., & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88, 669. Alkhatlan, A., & Kalita, J. (2019). Intelligent tutoring systems: A comprehensive historical survey with recent developments. International Journal of Computer Applications, 181, 1–20. Almond, R. G., Mislevy, R. J., Steinberg, L. S., Yan, D., & Williamson, D. M. (2015). Bayesian networks in educational assessment. New York: Springer.

312 Roy Levy and Robert J. Mislevy

Andrews, J. J., Kerr, D., Mislevy, R. J., von Davier, A., Hao, J., & Liu, L. (2017). Modeling collaborative interaction patterns in a simulation-based task: Modeling collaborative interaction patterns in a simulation-based task. Journal of Educational Measurement, 54, 54–69. Arieli-Attali, M., Ward, S., Thomas, J., Deonovic, B., & von Davier, A. A. (2019). The expanded evidence-centered design (e-ecd) for learning and assessment systems: A framework for incorporating learning goals and processes within assessment design. Frontiers in Psychology, 10, 853. Arminger, G., & Muthén, B. O. (1998). A Bayesian approach to nonlinear latent variable models using the Gibbs sampler and the Metropolis-Hastings algorithm. Psychometrika, 63, 271–300. Bartholomew, D. J. (1981). Posterior analysis of the factor model. British Journal of Mathematical and Statistical Psychology, 34, 93–99. Bayes, T. (1764). An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53, 370–418. Bechtel, G. G. (1966). Classical and Bayesian inference in certain latency processes. Psychometrika, 31, 491–504. Bergner, Y., Walker, E., & Ogan, A. (2017). Dynamic Bayesian network models for peer tutoring interactions. In A. A. von Davier, M. Zhu, & P. C. Kyllonen (Eds.), Innovative assessment of collaboration (pp. 249–268). Cham: Springer. Bernardo, J. M., & Smith, A. F. M. (2000). Bayesian theory. Chichester, UK: Wiley. Birnbaum, A. (1969). Statistical theory for logistic mental test models with a prior distribution of ability. Journal of Mathematical Psychology, 6, 258–276. Birnbaum, A., & Maxwell, A. E. (1960). Classification procedures based on Bayes’s formula. Applied Statistics, 9, 152–169. Reprinted in L. J. Cronbach and G. Gleser (Eds.), Psychological tests and personnel decisions (2nd ed.) (pp. 234–253) Urbana, IL: University of Illinois Press, 1965. Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443–459. Box, G. E. P., & Tiao, G. C. (1973). Bayesian inference in statistical analysis. Reading, MA: Addison-Wesley. Calandra, A. (1941). Scoring formulas and probability considerations. Psychometrika, 6, 1–9. Corbett, A. T., & Anderson, J. R. (1995). Knowledge tracing: Modeling the acquisition of procedural knowledge. User Modeling and User-Adapted Interaction, 4, 253–278. Cronbach, L. J., & Gleser, G. C. (1965). Psychological tests and personnel decisions (2nd ed.). Urbana, IL: University of Illinois Press. Culbertson, M. J. (2016). Bayesian networks in educational assessment: The state of the field. Applied Psychological Measurement, 40, 13–21. de Finetti, B. (1931). Funzione caratteristica di un fenomeno aleatorio. Atti Della R. Accademia Nazionale Dei Lincei, Serie 6. Memorie, Classe Di Scienze Fisiche, Mathematice e Naturale, 4, 251–299. de Finetti, B. (1937). La prévision: Ses lois logiques, ses sources subjectives. In Annales de l’Institut Henri Poincaré 7 (pp. 1–68). Translated by Kyburg and Smokler (Eds.) (1964). Studies in subjective probability (pp. 93–158). New York: Wiley. de Finetti, B. (1974). Theory of probability (Vol. 1). New York: Wiley. Edwards, W., Lindman, H., & Savage, L. J. (1963). Bayesian statistical inference for psychological research. Psychological Review, 70, 193–242. Embretson, S. (1984). A general latent trait model for response processes. Psychometrika, 49 (2), 175–186.

A History of Bayesian Inference 313

Embretson, S. E. (Ed.) (2013). Test design: Developments in psychology and psychometrics. Orlando, FL: Academic Press. Fienberg, S. E. (2006). When did Bayesian inference become “Bayesian”? Bayesian Analysis, 1, 1–40. Fisher, R. A. (1925). Statistical methods for research workers. Edinburgh, UK: Oliver and Boyd. Fox, J.-P. (2010). Bayesian item response modeling: Theory and applications. New York: Springer. Geerlings, H., Glas, C. A., & van der Linden, W. J. (2011). Modeling rule-based item generation. Psychometrika, 76, 337–359. Gelfand, A. E., & Smith, A. F. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85, 398–409. Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian data analysis (3rd ed.). Boca Raton, FL: Chapman and Hall/CRC. Geweke, J., & Zhou, G. (1996). Measuring the pricing error of the arbitrage pricing theory. The Review of Financial Studies, 9, 557–587. Gierl, M. J., & Lai, H. (2012). The role of item models in automatic item generation. International Journal of Testing, 12, 273–298. Glas, C. A.W., & van der Linden, W. J. (2001). Modeling variability in item parameters in item response models. Research Report 01-11. Enschede: University of Twente. Good, I. J. (1950). Probability and the weighing of evidence. London: Charles Griffin. Gross, A. L. (1973). Prediction in future samples studied in terms of the gain from selection. Psychometrika, 38, 151–172. Gulliksen, H. (1961). Measurement of learning and mental abilities. Psychometrika, 26, 93– 107. Hambleton, R. K., & Novick, M. R. (1973). Toward an integration of theory and method for criterion-referenced tests. Journal of Educational Measurement, 10, 159–170. Herman, L. M., & Dollinger, M. B. (1966). Predicting effectiveness of Bayesian classification systems. Psychometrika, 31, 341–349. Hoijtink, H. (1998). Constrained latent class analysis using the Gibbs sampler and posterior predictive p-values: Applications to educational testing. Statistica Sinica, 8, 691–711. Hoijtink, H., & Molenaar, I. W. (1997). A multidimensional item response model: Constrained latent class analysis using the Gibbs sampler and posterior predictive checks. Psychometrika, 62, 171–189. Huynh, H. (1976). On the reliability of decisions in domain-referenced testing. Journal of Educational Measurement, 13, 253–264. Jackman, S. (2009). Bayesian analysis for the social sciences. Chichester, UK: Wiley. Jackson, P. H. (1972). Simple approximations in the estimation of many parameters. British Journal of Mathematical & Statistical Psychology, 25, 213–228. Jackson, P. H., Novick, M. R., & Thayer, D. T. (1971). Estimating regressions in m groups. British Journal of Mathematical and Statistical Psychology, 24, 129–153. Jansen, M. G. H. (1986). A Bayesian version of Rasch’s multiplicative Poisson model for the number of errors of an achievement test. Journal of Educational Statistics, 11, 147–160. Jeffreys, H. (1939). Theory of probability. Oxford, UK: Clarendon Press. Jeffreys, H. (1946). An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences, 186, 453–461. Jensema, C. J. (1974). The validity of Bayesian tailored testing. Educational and Psychological Measurement, 34, 757–766.

314 Roy Levy and Robert J. Mislevy

Jeon, M., Draney, K., & Wilson, M. (2015). A general saltus LLTM-R for cognitive assessments. In R. E. Millsap, D. M. Bolt, L. A. van der Ark, & W.-C. Wang (Eds.), Quantitative Psychology Research (pp. 73–90). Cham: Springer. Johnson, M. S., & Jenkins, F. (2005). A Bayesian hierarchical model for large-scale educational surveys: An application to the National Assessment of Educational Progress (ETS Research Report RR-04-38). Princeton, NJ: ETS. Johnson, M. S., & Sinharay, S. (2005). Calibration of polytomous item families using Bayesian hierarchical modeling. Applied Psychological Measurement, 29, 369–400. Kaplan, D. (2014). Bayesian statistics for the social sciences. New York: The Guilford Press. Karabatsos, G., & Batchelder, W. H. (2003). Markov chain estimation for test theory without an answer key. Psychometrika, 68, 373–389. Kaufman, G. M., & Press, S. J. (1973). Bayesian factor analysis (Report No. 7322). Chicago, IL: Center for Mathematical Studies in Business and Economics, University of Chicago. Kelley, T. L. (1923). Statistical method. New York: Macmillan. Laplace, P. S. (1774/1986). Mémoire sur la probabilité des causes par les évènements. Mémoires de Mathématique et de Physique Présentés à l’Académie Royale Des Sciences, Pars Divers Savans, & Lûs Dans Ses Assemblies, 6, 621–656. (Reprinted in Laplace’s Oeuvres Complètes 827–65). Translated in Stigler, S. M. (1986). Laplace’s 1774 memoir on inverse probability. Statistical Science, 1(3), 359–378. Laplace, P. S. (1812). Théorie analytique des probabilités. Paris: Courcier. Lazarsfeld, P. F. (1950). The interpretation and computation of some latent structures. In S. A. Stouffer, L. Guttman, E. A. Suchman, P. F. Lazarsfeld, S. A. Star, & J. A. Clausen (Eds.), Studies in social psychology in World War II. Vol IV. Measurement and prediction (pp. 413–472). Princeton, NJ: Princeton University Press. Lazarsfeld, P. F., & Henry, N. W. (1968). Latent structure analysis. Boston, MA: Houghton Mifflin. Lee, S.-Y. (1981). A Bayesian approach to confirmatory factor analysis. Psychometrika, 46, 153–160. Lee, S.-Y. (2007). Structural equation modeling: A Bayesian approach. Chichester, UK: Wiley. Levy, R. (2009). The rise of Markov chain Monte Carlo estimation for psychometric modeling. Journal of Probability and Statistics, 2009, 1–18. Levy, R., & Mislevy, R. J. (2016). Bayesian psychometric modeling. Boca Raton, FL: Chapman and Hall/CRC. Lindley, D. V. (1969a). A Bayesian estimate of true score that incorporates prior information (Research Bulletin No. 69-75). Princeton, NJ: Educational Testing Service. Lindley, D. V. (1969b). A Bayesian solution for some educational prediction problems (Research Bulletin No. 69-57). Princeton, NJ: Educational Testing Service. Lindley, D. V. (1969c). A Bayesian solution for some educational prediction problems, II (Research Bulletin No. 69-91). Princeton, NJ: Educational Testing Service. Lindley, D. V. (1970). A Bayesian solution for some educational prediction problems, III (Research Bulletin No. 70-33). Princeton, NJ: Educational Testing Service. Lindley, D. V., & Smith, A. F. M. (1972). Bayes estimates for the linear model. Journal of the Royal Statistical Society. Series B, 34, 1–41. Lord, F. M. (1986). Maximum likelihood and Bayesian parameter estimation in item response theory. Journal of Educational Measurement, 23, 157–162. Lunn, D., Spiegelhalter, D., Thomas, A., & Best, N. (2009). The BUGS project: Evolution, critique and future directions. Statistics in Medicine, 28, 3049–3067. Magis, D., & Raîche, G. (2012). On the relationships between Jeffreys modal and weighted likelihood estimation of ability under logistic irt models. Psychometrika, 77, 163–169.

A History of Bayesian Inference 315

Martin, J. K., & McDonald, R. P. (1975). Bayesian estimation in unrestricted factor analysis: A treatment for Heywood cases. Psychometrika, 40, 505–517. McGrayne, S. B. (2011). The theory that would not die: How Bayes’ rule cracked the enigma code, hunted down Russian submarines, and emerged triumphant from two centuries of controversy. New Haven, CT: Yale University Press. Meyer, D. L. (1964). A Bayesian school superintendent. American Educational Research Journal, 1, 219–228. Meyer, D. L. (1966). Chapter II: Bayesian Statistics. Review of Educational Research, 36, 503–516. Mislevy, R. J. (1986). Bayes modal estimation in item response models. Psychometrika, 51, 177–195. Mislevy, R. J. (1994). Evidence and inference in educational assessment. Psychometrika, 59, 439–483. Mislevy, R. J. (2018). Sociocognitive foundations of educational measurement. New York: Routledge. Mislevy, R. J., Beaton, A. E., Kaplan, B., & Sheehan, K. M. (1992). Estimating population characteristics from sparse matrix samples of item responses. Journal of Educational Measurement, 29, 133–161. Muthén, B., & Asparouhov, T. (2012). Bayesian structural equation modeling: A more flexible representation of substantive theory. Psychological Methods, 17, 313–335. Novick, M. R. (1969). Multiparameter Bayesian indifference procedures. Journal of the Royal Statistical Society. Series B, 31, 29–64. Novick, M. R. (1970). Bayesian considerations in educational information systems (ACT Research Report No. 38). The American College Testing Program. Novick, M. R., & Jackson, P. H. (1970). Bayesian guidance technology. Review of Educational Research, 40, 459–494. Novick, M. R., & Jackson, P.H. (1974). Statistical methods for educational and psychological research. New York: McGraw-Hill. Novick, M. R., Jackson, P. H., & Thayer, D. T. (1971). Bayesian inference and the classical test theory model: Reliability and true scores. Psychometrika, 36, 261–288. Novick, M. R., & Lewis, C. (1974). Prescribing test length for criterion-referenced measurement. In C. W. Harris, M. C. Alkin, & W. J. Popham (Eds.), Problems in criterionreferenced measurement (pp. 139–158). Los Angeles, CA: Center for the Study of Evaluation, University of California, Los Angeles. Novick, M. R., Lewis, C., & Jackson, P. H. (1973). The estimation of proportions in m groups. Psychometrika, 38(1), 19–46. Oranje, A., & Kolstad, A. (2019). Research on psychometric modeling, analysis, and reporting of the National Assessment of Educational Progress. Journal of Educational and Behavioral Statistics, 44, 648–670. Owen, R. J. (1969). Tailored testing (Research Bulletin No. 69-92). Princeton, NJ: Educational Testing Service. Owen, R. J. (1975). A Bayesian sequential procedure for quantal response in the context of adaptive mental testing. Journal of the American Statistical Association, 70, 351–356. Patz, R. J., & Junker, B. W. (1999a). A straightforward approach to Markov chain Monte Carlo methods for item response models. Journal of Educational and Behavioral Statistics, 24, 146–178. Patz, R. J., & Junker, B. W. (1999b). Applications and extensions of MCMC in IRT: Multiple item types, missing data, and rated responses. Journal of Educational and Behavioral Statistics, 24, 342–366.

316 Roy Levy and Robert J. Mislevy

Pearl, J. (1988). Probabilistic reasoning in intelligent systems: Networks of plausible inference. San Francisco, CA: Morgan Kaufmann. Pearl, J. (2018). A personal journey into Bayesian networks (Technical Report No. R-476). Los Angeles, CA: UCLA Cognitive Systems Laboratory. Pearson, K. (1941). The laws of chance, in relation to thought and conduct. Biometrika, 32, 89–100. Raiffa, H. (1961). Statistical decision theory approach to item selection for dichotomous test and criterion variables. In H. Solomon (Ed.), Studies in item analysis and prediction (pp. 221–232). Stanford, CA: Stanford University Press. Raiffa, H., & Schlaifer, R. (1961). Applied statistical decision theory. Boston, MA: Harvard Business School. Reichenberg, R. (2018). Dynamic Bayesian networks in educational measurement: Reviewing and advancing the state of the field. Applied Measurement in Education, 31, 335–350. Reye, J. (2004). Student modelling based on belief networks. International Journal of Artificial Intelligence in Education, 14, 63–96. Robbins, H. (1956). An empirical Bayes approach to statistics. Proc. Third Berkeley Symp. Math. Statist. Probab., 1, 157–163. Berkeley, CA: University of California Press. Robbins, H. (1960). A statistical screening problem. In I. Olkin, S. G. Ghurye, W. Hoeffding, W. G. Madow, & H. B. Mann (Eds.), Contributions to probability and statistics: Essays in honor of Harold Hotelling (pp. 352–357). Stanford, CA: Stanford University Press. Robert, C., & Casella, G. (2011). A short history of Markov chain Monte Carlo: Subjective recollections from incomplete data. Statistical Science, 26, 102–115. Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581–592. Rubin, D. B. (1977). Formalizing subjective notions about the effect of nonrespondents in sample surveys. Journal of the American Statistical Association, 72, 538–543. Rubin, D. B. (1978). Multiple imputations in sample surveys: a phenomenological Bayesian approach to nonresponse. Proceedings of the Survey Research Methods Section of the American Statistical Association, 1, 20–34. Rubin, D. B. (1981). Estimation in parallel randomized experiments. Journal of Educational Statistics, 6, 377–401. Rupp, A. A., Templin, J., & Henson, R. A. (2010). Diagnostic measurement: Theory, methods, and applications. New York: The Guilford Press. Samejima, F. (1969). Estimating of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, No. 17. Savage, L. J. (1954). The foundations of statistics. New York: Dover Publications. Scheines, R., Hoijtink, H., & Boomsma, A. (1999). Bayesian estimation and testing of structural equation models. Psychometrika, 64, 37–52. Shi, J.-Q., & Lee, S.-Y. (1998). Bayesian sampling-based approach for factor analysis models with continuous and polytomous data. British Journal of Mathematical and Statistical Psychology, 51, 233–252. Shute, V. J. (2011). Stealth assessment in computer-based games to support learning. In S. Tobias & J. D. Fletcher (Eds.), Computer games and instruction (pp. 503–524). Charlotte, NC: Information Age Publishing. Sinharay, S. (2006). Bayesian statistics in educational measurement. In S. K. Upadhyay, U. Singh, & D. K. Dey (Eds.), Bayesian statistics and its applications (pp. 422–437). New Delhi: Anamaya Publishers.

A History of Bayesian Inference 317

Sinharay, S. (2016). Bayesian model fit and model comparison. In W. J. van der Linden (Ed.), Handbook of Item Response Theory, Volume 2 (pp. 379–394). Boca Raton, FL: Chapman and Hall/CRC. Stigler, S. M. (1986). The history of statistics: The measurement of uncertainty before 1900. Cambridge, MA: Belknap Press. Swaminathan, H., Hambleton, R. K., & Algina, J. (1975). A Bayesian decision-theoretic procedure for use with criterion-referenced tests. Journal of Educational Measurement, 12, 87–98. Swaminathan, H., & Gifford, J. A. (1982). Bayesian estimation in the Rasch model. Journal of Educational Statistics, 7, 175–191. Swaminathan, H., & Gifford, J. A. (1985). Bayesian estimation in the two-parameter logistic model. Psychometrika, 50, 349–364. Swaminathan, H., & Gifford, J. A. (1986). Bayesian estimation in the three-parameter logistic model. Psychometrika, 51, 589–601. Thurstone, L. L. (1935). The vectors of mind: Multiple-factor analysis for the isolation of primary traits. Chicago, IL: University of Chicago Press. Tsutakawa, R. K., & Lin, H. Y. (1986). Bayesian estimation of item response curves. Psychometrika, 51, 251–267. Turcˇ hin, V. F. (1971). On the computation of multidimensional integrals by the MonteCarlo method. Theory of Probability & Its Applications, 16, 720–724. Urry, V. W. (1977). Tailored testing: A successful application of latent trait theory. Journal of Educational Measurement, 14, 181–196. Villano, M. (1992). Probabilistic student models: Bayesian Belief Networks and Knowledge Space Theory. In C. Frasson, G. Gauthier, & G. I. McCalla (Eds.), Intelligent Tutoring Systems (Vol. 608, pp. 491–498). Berlin Heidelberg: Springer. Warm, T. A. (1989). Weighted likelihood estimation of ability in item response theory. Psychometrika, 54, 427–450. Weiss, D. J., & McBride, J. R. (1984). Bias and information of Bayesian adaptive testing. Applied Psychological Measurement, 8, 273–285. Wolfe, J. H. (1981). Optimal item difficulty for the three-parameter normal ogive response model. Psychometrika, 46, 461–464.

14 HISTORY OF TEST EQUATING METHODS AND PRACTICES THROUGH 1985 Michael J. Kolen1

In this chapter, I focus on the historical development of test equating concepts, methods and practices. In addition to addressing data collection designs and statistical methods for equating, I consider, from an historical perspective, how test equating is associated with the processes of test development and test use. I decided to focus this treatment through 1985, which seems fitting because it was the year in which the Test Standards (AERA, APA, & NCME, 1985) first provided standards for test equating. In this chapter, I use the terms “equating” and “test equating” interchangeably. Although the focus is through 1985, in a few instances I chose to reference and discuss work completed after 1985 that provided terminology that clarifies equating data collection designs and terminology used previously. In addition, I cite later references that follow from work that was conducted in 1985 or before. According to Kolen and Brennan (2014), Equating is a statistical process that is used to adjust scores on test forms so that scores on the forms can be used interchangeably. Equating adjusts for differences in difficulty among forms that are built to be similar in difficulty and content. (p. 2) This definition of equating explicitly states the following: equating requires a test development process that produces test forms that are similar in difficulty and content. This definition also states that the intent of equating is to use test development procedures in combination with statistical methods to produce scores on test forms that can be used interchangeably. Equating requires that test developers and psychometricians work together to produce test forms that are similar to one another in content and statistical

History of Test Equating Methods 319

characteristics and to apply statistical methods that lead to score interchangeability. When implementing equating, the test forms are built to a common set of content and statistical specifications to support the interchangeability of scores across the forms. Situations in which equating is not possible by this definition include statistically relating scores on two tests that 1) are not built to a common set of specifications (e.g., two different college admissions tests), 2) are intended to differ in difficulty (e.g., grade level achievement tests that are intended to be administered in two different grades), or 3) are intended to differ in reliability (e.g., two tests that differ substantially in length). See Holland (2007) and Holland and Dorans (2006) for treatments of the history of such linking methods. Terminology used in treatments of equating and other linking methods has evolved over time. For example, Flanagan (1939, 1951) appears to have used the term “equating” to refer to score adjustment for any type of linking. So, for example, he might have referred to the process of “equating” scores on intelligence tests from different publishers. Still, he was very careful to distinguish what we now refer to as equating from other linking processes. He did this by clearly indicating that when scores are to be used interchangeably, he was conducting a score transformation for alternate forms that were built to the same content and statistical specifications. Thus, it was clear from his presentation when he was discussing what we now refer to as equating. In the present chapter, I use current terminology. I have organized this chapter, for the most part, chronologically. I begin with a consideration of some forerunners of equating, and follow with a discussion of Flanagan’s (1939) work that is an early example that considers most of the major conceptual issues associated with equating. I then move on to the first major general treatment of equating by Flanagan (1951), followed by Lord’s (1950) statistical treatment of equating and the second major general treatment by Angoff (1971). I then provide a personal account of smoothed equipercentile equating, a consideration of item response theory (IRT) equating, and a discussion of the importance of the Holland and Rubin (1982) Test Equating edited book. I conclude with a discussion of the 1985 Test Standards (AERA, APA, & NCME, 1985) and a brief discussion of some important post-1985 publications on equating.

Forerunners of Equating In an early application in the use of alternate test forms, Yoakum and Yerkes (1920) described the development of the Army Alpha group-administered intelligence test for use in military selection during World War I. The Army Alpha consisted of ten tests, each of which was short, homogeneous, and could be administered in 5 minutes or less. Yoakum and Yerkes (1920) described how test developers constructed alternate forms of the test “in order to prevent coaching and cheating” (p. 4). Test

320 Michael J. Kolen

developers began with enough items for 10 alternate forms of each of the tests and randomly assigned items to the 10 alternate forms to produce test forms that were of “approximately equal difficulty” (p. 4). The test forms were administered in what they referred to as the “official army trial in the fall of 1917” (p. 7). Five alternate forms were judged to be satisfactory measures of intelligence (p. 7). After the forms were administered and the results analyzed, the test developers concluded that “the differences in forms were so slight as to indicate the success of the random method of selecting items” (p. 8), although one form was more difficult than the other forms. The developers of the Army Alpha did not use statistical equating adjustments. Instead, they used random assignment of items to alternate forms within tests to develop forms that were sufficiently similar in difficulty. In this way they used the scores from the different forms as if they were interchangeable. In the years surrounding the development of the Army Alpha, investigators studied the linking of scores on different tests in areas including handwriting (e.g., Kelley, 1914), spelling (e.g., Otis, 1916), and intelligence (e.g., Otis, 1922; Thorndike, 1922). Because these investigators were linking scores on different tests rather than scores on alternate forms of the same test, the process being followed is not equating and the resulting scores on the tests could not be used interchangeably. Still, some of the statistical methods used in these investigations were later adapted to equating contexts. For example, Kelley (1914) introduced the linear linking function, which set equal scores on two tests that were equal standard deviations away from their respective means. Define x as a score on Test X and y as a score on Test Y. For a population of examinees, define ðXÞ as the mean on Test X, ðY Þ as the mean on Test Y, ðXÞ as the standard deviation on Test X, and ðY Þ as the standard deviation of Test Y. The following standardized deviation scores are set equal as defined by Kelley (1914): x  ðXÞ y  ðY Þ ¼ : ðXÞ ðY Þ

ð1Þ

Solving for y, it can be shown that the linear function for linking scores on Test X to the scale of Test Y, ly ðxÞ, in slope-intercept form is,

ly ðxÞ ¼ y ¼

  ðY Þ ðY Þ x þ ðY Þ 

ðXÞ : ðXÞ ðXÞ

ð2Þ

Otis (1922) contrasted this linear linking function with the linear regression function,   ðY Þ ðY Þ y^ ¼ ðX; Y Þ x þ ðY Þ  ðX; Y Þ

ðXÞ ; ð3Þ ðXÞ ðXÞ

History of Test Equating Methods 321

where ðX; Y Þ is the correlation between scores on Test X and Test Y. The difference between these two functions is that the slope of the linear regression function is equal to the slope of the linear linking function multiplied by the correlation. Otis (1922) also pointed out that the linear regression equation is not symmetric. Thorndike (1922) made a similar point, although he focused on non-linear regression methods. Both Otis (1922) and Thorndike (1922) concluded that, due to the symmetry property, the linear linking function was more appropriate for linking scores on different tests for the situations in which they were interested. If the terms in the linking linear functions are redefined to be from scores on alternate forms of a test, the resulting function is the linear equating function that is one of the central equating functions (e.g., see Kolen & Brennan, 2014, p. 31). In addition, concerns outlined by Thorndike (1922) and Otis (1922) led to the notion that equating functions should be symmetric. Thus, equating functions are not regression functions (that are, by definition, not symmetric), which is a basic property of equating functions (see Kolen & Brennan, 2014, p. 9). In addition to providing further details regarding the use of linear linking functions, Kelley (1923, pp. 109–122) described what he referred to as “the equivalence of successive percentiles method” for test linking. Kelley (1923) even discussed the use of smoothing. By applying this method to scores from alternate forms of tests, it can be viewed as a precursor to equipercentile equating as described by Kolen and Brennan (2014, p. 36).

An Early Equating Example Flanagan (1939) provided a general discussion of equating as part of technical documentation for the Cooperative Achievement Tests. He discussed conceptual issues as well as issues associated with test development, data collection designs, and statistical equating methods.

Conceptual Issues Flanagan (1939) stated, “a fundamental necessity in any test of which more than one form is published is comparability between the scores obtained from several forms” (p. 10). He indicated that “all statements of comparability necessarily refer to a particular group [of examinees]” (p. 10). Among a variety of linking situations, at times he focused on the situation where scores on alternate test forms could be used interchangeably, which is now referred to as test equating. He also stressed that linking functions, including equating functions, are group dependent.

Test Development Flanagan (1939) emphasized the importance of test development procedures to equating by stating the following:

322 Michael J. Kolen

As a practical matter comparability is not only desirable but can be approximated quite closely if proper methods are used in the construction and equating of the examinations. Such methods of construction include a detailed analysis of the content of the various forms with respect to various types of classification, such as topics included, operations required in responding, and the amount of time necessary for reading and answering the item. (p. 10) Thus, Flanagan (1939) emphasized the importance of systematic test development procedures and test specifications in the development of alternate test forms. Flanagan (1939) went on to describe what he referred to as “parallel construction of comparable forms” (p. 11). This method was implemented by administering a large number of items in a preliminary form of the test and selecting groups of items that were similar in difficulty to construct alternate forms. In this way, the raw scores on the forms would be “at least approximately comparable” (Flanagan, 1939, p. 11).

Data Collection Designs Flanagan (1939) discussed the use of data collection designs in equating scores on alternate forms. He considered a design in which examinees were administered the two forms to be equated with the order counterbalanced. In current terminology, this design is referred to as the single group design with counterbalancing (Kolen & Brennan, 2014, p. 14). He discussed the use of matching to assign pseudo-equivalent groups of examinees to test forms, although he provided little information on how the matching was accomplished. In addition, he discussed randomly assigning examinees to test forms. In current terminology, this design is referred to as the random groups design (Kolen & Brennan, 2014, p. 13). Finally, he discussed the use of regression methods to predict scores on an external variable from scores on each of the two forms. Such a process could be viewed as a precursor to the common-item nonequivalent groups design (CINEG; Kolen and Brennan, 2014, p. 18). This later design is also referred to, in the literature, as the nonequivalent anchor test design (NEAT; Holland & Dorans, 2006).

Statistical Equating Methods Flanagan (1939, p. 11) also briefly discussed statistical methods for conducting equating. One method involved adjusting scores so that the means of the two forms were equal, which in current terminology is referred to as mean equating (Kolen & Brennan, 2014, p. 30). Flanagan (1939, p. 12) discussed what he referred to as the standard score method, which in current terminology is referred to as linear equating (Kolen & Brennan, 2014, p. 31). Flanagan (1939, p. 12)

History of Test Equating Methods 323

followed discussions by Otis (1922) and Thorndike (1922) that were presented earlier in the present chapter regarding why regression functions, due to their lack of symmetry, are not appropriate as equating functions. And he provided a description of what in current terminology is referred to as equipercentile equating (Kolen & Brennan, 2014, p. 36).

Summary Flanagan (1939) considered many of the concepts in equating that are viewed as important today. These concepts included the purposes for equating, designs for data collection, and the statistical methods for conducting equating. Interestingly, his presentation did not include any equations, but instead focused on concepts.

Flanagan (1951): The First Major General Treatment of Equating Flanagan (1951) wrote the first major general treatment of equating as a chapter in the first edition of Educational Measurement that was edited by E. F. Lindquist (1951). Flanagan (1951) was an expansion and updating of Flanagan (1939). In addition to equating, which was only a small portion of the chapter, he discussed what, in current terminology, would be referred to as test scores, score scales, and linking. In this section, I focus on equating considerations that were new to Flanagan (1951).

Conceptual Issues Referring to linking scores on tests, Flanagan (1951) went beyond Flanagan (1939) and emphasized that “comparability which would hold for all types of groups [of examinees] … is strictly and logically impossible (p, 748).” He stated that “as a practical matter, comparability of scores from forms of the same test can be approximated quite closely if proper methods are used in the construction and equating of the examinations” (p. 748). This discussion of comparability and group dependence is in line with recent discussions of population invariance (e.g., Kolen & Brennan, 2014, p. 12).

Test Development In addition to Flanagan’s (1951) chapter, two additional chapters in the Lindquist (1951) volume provided a discussion of development of alternate forms. Vaughn (1951), in his chapter Planning the Objective Test, stated that alternate forms “intended to be comparable should be based on the same outline of content. The more detailed the breakdown of subject matter in the outline, the more it facilitates construction of comparable forms” (p. 183). Thorndike (1951), in the chapter Reliability, stated the following:

324 Michael J. Kolen

The best guarantee of equivalence for two test forms would seem to be that a complete and detailed set of specifications for the test be prepared in advance of any final test construction. The set of specifications should indicate item types, difficulty level of items, procedures and standards for item selection and refinement, and distribution of items with regard to the content to be covered, specified in as much detail as seems feasible. If each test form is then built to conform to the outline, while at the same time care is taken to avoid identity or detailed overlapping of content, the two resulting test forms should be truly equivalent. That is, each test must be built to the specifications, but within the limits set by complete specifications each test should present a random sampling of items. In terms of the practical operations of test construction, it will often be efficient to assemble two equivalent test forms from a single pool of items which have been given preliminary tryout. Within the total test if the test is homogeneous in the character of its content, or within parallel homogenous sections of a heterogeneous test, items from the pool should be assigned to the two forms in such a way as to give the same distribution of item difficulties and the same distribution of item-test correlations in each form. (pp. 575–576) Taken together, Flanagan (1951), Thorndike (1951), and Vaughn (1951) provided a description of the concepts involved in constructing alternate forms that are fundamental components of current test development procedures for constructing alternate forms. The major issue discussed is the necessity of having detailed content specifications, item tryouts, and detailed statistical specifications that are used to drive the development of alternate forms of tests.

Data Collection Designs and Statistical Equating Methods Flanagan (1951, p. 750) considered the same data collection designs that were described by Flanagan (1939). Flanagan (1951, p. 750) also considered the same statistical methods that were described in Flanagan (1939). In addition, Flanagan (1951, pp. 752–758) provided an in-depth description of equipercentile equating (he referred to this method as the equipercentile method). He stated that “it appears that the most satisfactory method of obtaining ‘comparable’ scores for the various forms of a given test …” (p. 752) is the equipercentile method. Flanagan (1951) discussed the linking of scores by relating the estimated distributions of true scores on two tests using equipercentile methods. (See Kolen and Brennan, 2014 for a discussion of equating of true scores.) He further stated that when “scores on forms … have similar reliability coefficients … obtained scores can be used” (p. 753). That is, he indicated that observed scores, rather than true scores, can be used when scores on alternate forms are equated using equipercentile methods.

History of Test Equating Methods 325

Flanagan (1951, pp. 753–756) provided a detailed description with an example for conducting equipercentile equating. He assigned two alternate forms of a 90-item science test to matched samples of examinees (but he provided no details on how the matching was done). He provided grouped frequency distributions of scores on these alternate forms. He found the score on Form B that corresponded to scores at each score interval on Form A. He plotted these corresponding scores on Form A and Form B, and then hand-smoothed these score correspondences. Flanagan (1951) pointed out the following: Experience in preparing a number of sets of “equivalent” scores in this way soon reveals, when subsequent checks are available, that smoothing can result in worse, as well as better, tables of “equivalents” than would be obtained from the distributions without any smoothing. The novice is likely to pay too much attention to minor fluctuations in the positions of the points at the center of the curve rather than maintain a relatively straight line or smooth curve. Lack of experience is also likely to lead to paying too much attention to scattered points at the ends of the distribution which are based on only a few cases … The best type of training for this type of smoothing is to divide the equating sample into halves or to obtain two independent samples. After the plotting and smoothing operations have been completely carried out for one sample, the points from the second sample can be plotted on the same chart. This will tend to provide immediate corrections for systematic errors made in smoothing. The best estimate for the correct position of the curve will, of course, be somewhere between the two sets of points. (pp. 754–755) Thus, Flanagan (1951) emphasized the importance of having a skilled and experienced psychometrician conduct hand smoothing in equipercentile equating.

Summary Flanagan (1951) stressed the potential population dependence of equating relationships. Along with Vaughn (1951) and Thorndike (1951), he emphasized the importance of test specifications to equating. Flanagan (1951) focused on a random groups design and equipercentile equating methods. He discussed many of the practical issues in the hand smoothing of equipercentile relationships.

Lord (1950): Statistical Methods of Equating Whereas Flanagan (1939, 1951) emphasized the use of the random groups equating design (with matched samples) and equipercentile methods, equating of the Scholastic Aptitude Test (SAT) around 1950 was conducted using the CINEG design and linear methods. The first SAT Verbal test equating was conducted in June 1941. According to Donlon and Angoff (1971), a set of items

326 Michael J. Kolen

administered in April 1941 was carried over and administered as part of the June 1941 form of the test. Differences in means and standard deviations on the “carried over” (or “common”) items were … used to adjust the statistics on the entire test forms for those groups, … thereby providing a linear conversion from raw to scaled scores for the new form. (p. 32) The methods used were, apparently, early versions of linear methods for the common-item nonequivalent groups design. These methods were later applied to subsequent forms of the SAT Verbal and Mathematics examinations.

Frederic Lord and Linear Equating Lord (1950) provided a description of linear equating methods for different data collection designs that were in use at ETS at the time he wrote this report. According to Lord (1950), “The methods of obtaining comparable scores discussed in the present paper are in most cases methods in actual use at ETS, the more advanced methods having been introduced by Dr. [Ledyard] Tucker” (p. 4). In this report, Lord (1950) described six linear equating procedures (names are based on current terminology) as follows: 1. 2. 3. 4. 5. 6.

single group design without counterbalancing, random groups design, single group design with counterbalancing, common-item random groups using chained methods, common-item random groups using maximum likelihood estimates, CINEG with assumptions attributed to Tucker.

In addition to providing equations for each of these procedures, Lord (1950) derived and provided equations that could be used to calculate standard errors for all but the last method listed. Through the use of an empirical example, Lord (1950) illustrated how to estimate the relative precision of different procedures. For example, he demonstrated that the addition of a set of common items (Procedure 5 above) that had a strong correlation with total score substantially reduced the standard error of equating as compared to using the random groups design with no common items (Procedure 2 above).

Tucker Linear Equating Method Procedure 6 above has a particularly interesting history. Lord (1950), Angoff (1971), and Gulliksen (1950) all attributed this method to Ledyard Tucker, who was doing quite a bit of work with ETS at the time. Apparently, Tucker never wrote up the method for distribution. It can be shown that the equations for this

History of Test Equating Methods 327

design are the same as the equations for the linear method for the common-item random groups design (Procedure 5). This method is currently referred to as the Tucker method. Gulliksen (1950, pp. 299–301) also provided a detailed description of this method, and also attributed it to Tucker. Equating with common items using the CINEG requires making strong assumptions. In this design, examinees from Population 1 are administered Form X and common item set V. Examinees from Population 2 are administered Form Y and common item set V. To derive the Tucker method, it is assumed that the slope and intercept of the linear regression (see Equation 3) function of Y on V and X on V are the same for scores from the two populations. Using a subscript to indicate population, α to represent the slope of a linear regression function, and β the intercept, the derivation of the Tucker method assumes that 1 ðXjV Þ ¼ 2 ðXjV Þ;

ð4Þ

1 ðXjV Þ ¼ 2 ðXjV Þ;

ð5Þ

1 ðY jV Þ ¼ 2 ðY jV Þ; and

ð6Þ

1 ðY jV Þ ¼ 2 ðY jV Þ:

ð7Þ

In addition, the derivation of the Tucker method assumes that the conditional standard deviations of the linear regressions are equal for the two populations so that, 1 ðXjV Þ ¼ 2 ðXjV Þ; and

ð8Þ

1 ðXjV Þ ¼ 2 ðXjV Þ:

ð9Þ

As Kolen and Brennan (2014, p. 105) indicate, these are the assumptions of univariate selection theory as described by Gulliksen (1950, pp. 131–132). So even though linear regression methods are not directly appropriate for equating, linear regression assumptions were used by Tucker to develop one of the first linear equating methods for the CINEG design.

Summary In the 1940s, ETS developed and began using linear equating methods with the CINEG design. As shown, these methods, including the Tucker linear method, required the use of strong statistical assumptions. In addition, Lord (1950) derived standard errors for many of the linear equating procedures.

328 Michael J. Kolen

Angoff (1971): The Second Major Treatment of Equating Angoff (1971) provided a comprehensive description of equating in the Second Edition of Educational Measurement (Thorndike, 1971). Like Flanagan (1951), he provided this treatment along with discussions of score scales and linking methods.

Conceptual Issues Angoff (1971) claimed that scores on test forms that are equated must “measure, within acceptable limits, the same psychological function” (p. 563). That is, forms that are equated must measure the same thing. In addition, he stated that the equating “conversion must be unique … [and] should be independent of the individuals from whom the data were drawn to develop the conversion and should be freely applicable to all situations” (p. 563). Note that this second consideration differs from Flanagan (1951) who expected equating to be group dependent, though even he suggested that, from a practical perspective, when equating alternate forms the equating relationships should not depend heavily on examinee group. Citing Flanagan (1951) and Lord (1950), Angoff (1971) stated the following: A commonly accepted definition of equivalent scores is: Two scores, one on Form X and the other on Form Y (where X and Y measure the same function with the same degree of reliability), may be considered equivalent if their corresponding percentile ranks in any given group are equal. (p. 563) Thus, according to this statement, after equating is conducted, the score distributions for the two forms should be the same for any particular group of examinees. As discussed earlier, Flanagan (1939, 1951) drew upon work by Otis (1922) and Thorndike (1922) to describe why regression functions are not appropriate for use as equating functions. Based on such considerations, Angoff (1971) stated, the problem of regression and prediction and the problem of transforming scores [as is done in equating] are different problems. The later [equating] is highly restrictive with respect to the types of characteristics under consideration. (p. 563) Angoff (1971) contrasted linear and equipercentile equating methods. He stated: There is little doubt that the only way to ensure equivalent scores when the distribution shapes are different is to equate by curvilinear (equipercentile)

History of Test Equating Methods 329

methods. Under such circumstances, the equivalency is established by stretching and compressing the raw score scale of one of the forms so that its distribution will conform to the shape given by the other form. (p. 564) Angoff (1971) went on to say, however, that linear equating is a very close approximation to equipercentile equating when the shapes of the raw score distributions are similar. (p. 564) In addition, he stated that linear equating methods are often preferred to equipercentile methods because a linear method is analytical and verifiable and is free from any errors of smoothing, which can produce serious errors in the score range in which data are scant and/or erratic (p. 564)

Data Collection Designs and Statistical Equating Methods Angoff (1971, pp. 568–586) provided a comprehensive description of equating data collection designs, presenting different statistical methods for each of these designs. These statistical methods included linear and curvilinear statistical methods. He began with the random groups design. He used this design to introduce the linear equating method, discussed how to convert raw scores to scale scores, and introduced expressions for standard errors of linear equating that Lord (1950) originally presented. He introduced an equipercentile method through a graphical example that used hand smoothing and that had similarities to the example Flanagan (1951, pp. 753–758) provided. He also described two analytic methods for smoothing score distributions: the Cureton and Tukey (1951) method and the Keats and Lord (1962) strong true score method that uses the negative hypergeometric distribution. Angoff (1971) described the single group design with counterbalancing (where examinees take two forms with order counterbalanced). He provided equations for a linear method under this design, and he provided an expression for the standard error of equating, which he attributed to Lord (1950). He showed how to use standard errors of equating to compare linear equating precision of the random groups and single group with counterbalancing designs. In addition, he briefly outlined an equipercentile method for this design. Angoff (1971) described a linear method for a common-item random groups design that Lord (1950) first presented. In addition to equations for this method, he provided standard errors of equating that Lord (1950) first developed. Using the standard error expressions, he demonstrated how equating precision improves as the correlation between scores on the test forms and the common item set increases

330 Michael J. Kolen

He described the Tucker method, (Angoff, 1971, p. 580) and pointed out that the computational procedures for the Tucker method are the same as those for the linear method for the common-item random groups design Lord (1950) described. Angoff (1971) further stated that this method is appropriate for “groups not widely different in ability” (p. 579). Angoff (1971) pointed out that the derivation of this method relies on strong statistical assumptions by stating the following: The general caution that statistical methods should not be used unless the assumptions that are basic to their derivation can be fulfilled is seldom as clear as it is here … [These assumptions] are applicable only when it may be assumed that the regression systems for [the two groups] would have been the same had the groups taken precisely the same tests. This is not an unreasonable assumption when groups are similar in all relevant respects. (p. 580) Angoff (1971) stressed that the CINEG design is quite flexible. He stated that the common items can be administered in addition to and separate from Form X and Form Y, or as part of Form X and Form Y …. It may be a separately timed section for Forms X and Y, or it may be a set of discrete items interspersed through the two forms but capable of yielding a total score. (p. 580) Angoff (1971) went on to say that the common item set “is constructed and administered to represent psychologically the same task for both groups” (p. 580). Angoff (1971, pp. 581–582) described an equipercentile method for the CINEG design that Braun and Holland (1982) later referred to as frequency estimation. The basic assumptions of this method are that the distribution of X given V is the same for both groups and the distribution of Y given V is the same for both groups. Angoff indicated that this method is appropriate “for groups not widely different in ability.” Angoff (1971) described a linear method that he indicated is appropriate “for samples of different ability” (p. 582). He attributed this method to Levine (1955). (Also see Kolen & Brennan, 2014, pp. 109–116.) This method makes assumptions about true scores that are similar to the Tucker assumptions made for observed scores. These assumptions are that true scores on Form X and V correlate 1.0 for both populations and true scores on Form Y and V correlate 1.0 for both populations. In addition, the linear regression of true scores on Form X and true scores on V are the same for both populations, and the linear regression of true scores on Form Y and true scores on V are the same for both populations. Finally, it is assumed that the measurement error variance for Form X is the same for both populations. The

History of Test Equating Methods 331

same assumption is made for Form Y and V. In addition, the classical congeneric model (Feldt & Brennan, 1989) is assumed to hold for X, Y, and V. Note that essentially the same model assumption was described by Angoff (1953). In contrasting the Tucker and Levine methods, it is important to recognize that the Tucker method makes assumptions about regression of observed scores whereas the Levine method makes assumptions about true scores, including the assumption that true scores correlate 1.0, implying that Forms X, Y, and common items V all measure the same thing. Angoff (1971, p. 583) discussed the chained linear procedure Lord (1950) described and that was discussed earlier in this chapter. He also provided the Lord (1950) standard errors of equating for this procedure. In addition, he briefly described a chained equipercentile method.

Equating Error Concepts Angoff (1971, pp. 586–587) ended his equating presentation with a discussion of equating error. He pointed out that error in equating “can loom quite large in relation to the error in a mean and can seriously affect comparisons of group performance” (p. 587). In the same paragraph he stated, “moreover, in any large testing program where many forms of the same test are produced and equated, the error of equating can become quite considerable, if left unchecked” (p. 587). He stated that such error can cumulate and be substantial over a chain of equatings. He also discussed the possibility of equating strains developing in chains of equating and how this concern can be attended to by equating to multiple old forms and averaging the results.

Angoff (1971) and Flanagan (1951) Compared From a conceptual point of view, Angoff (1971) and Flanagan (1951) had apparent differences. Angoff (1971) stressed that equating should be population invariant, whereas Flanagan (1951) stressed that equating is necessarily population dependent. Angoff (1971) tended to focus on the CINEG design whereas Flanagan (1951) focused on the random groups design. Angoff (1971) appeared to prefer linear methods because they were analytic, and they gave similar results to equipercentile equating in many situations. Flanagan (1951) preferred equipercentile methods with hand smoothing, and he described steps that could be taken to address the judgmental aspects of hand smoothing.

Smoothed Equipercentile Equating: A Personal Account In this section, I relay some of my personal experiences in developing and implementing smoothed equipercentile equating methods. In addition, I relate this experience to historical developments.

332 Michael J. Kolen

Hand Smoothing In 1981, I was hired by Bob Brennan to be a psychometrician at the American College Testing Program, now known as ACT. Brennan had recently started a new department in the Research Division. One of the tasks of this department was to take over the equating of the ACT Assessment college admissions examination. Since the early 1960s, the ACT Assessment had been equated with data collected using the random group design and hand-smoothed equipercentile equating procedures. Under Brennan’s supervision, I was assigned this equating work as my first major responsibility. I began working at ACT at the beginning of August, with the first equating planned for October. I attended a meeting with Brennan, and Rolland Ray in early August, 1981. Ray, who was approaching retirement, had been trained by E. F. Lindquist in the early 1960s to use hand-smoothed equipercentile methods to equate the ACT Assessment. Ray started with relative cumulative frequency distributions of raw scores. He had clerical staff plot the points for these distributions. Then, with the aid of a manual drawing tool, he hand-smoothed the distributions, which, for the old and new form, were plotted on the same very large sheet of graph paper. Equipercentile equivalents were found graphically and plotted on another sheet of graph paper, which also contained a raw-to-scale score curve for the old form. The outcome of this process was a conversion table of raw-to-scale score equivalents for the new form. This process was conducted for each new form. My recollection is that ACT equated around eight new forms each year. Ray explained many of the intricacies of hand smoothing, which were generally in line with some of Flanagan’s (1951) comments on hand smoothing. Following the meeting with Ray, Brennan and I came up with a plan. I (with the help of my graduate assistant Dean Colton) decided to replicate the equatings that had been conducted in 1980 to see if we could put together a process that would produce equating conversion tables that were similar to those used operationally in the past. To conduct this replication, I relied on some of the work I had done earlier. My dissertation was an empirical equating study that I had worked on under Leonard Feldt at The University of Iowa. A major portion of this study involved comparing equipercentile and item response theory (IRT) equating methods using a random groups design. This work is summarized in Kolen (1981). As part of my dissertation work, I had developed computer code for equipercentile (unsmoothed) equating methods along with IRT true and observed score methods. Using the computer code from my dissertation, I was able to conduct equipercentile equating (with no smoothing). Beginning with the relative cumulative frequency distributions and a raw-to-scale score conversion table for the old form, we used the computer code to produce unsmoothed raw-to-raw conversions and raw-to scale score conversions for the new form. I recall that these unsmoothed

History of Test Equating Methods 333

procedures produced results that were very similar to the operational conversion tables Ray provided. However, there were some differences in conversions, especially in regions of the raw score distribution where few examinees scored. We then created plots using a large-format plotter that was connected to The University of Iowa’s mainframe computer. On one sheet of paper we plotted the relative cumulative frequency distributions. On a second sheet of paper we plotted the unsmoothed raw-to-raw score conversions of raw scores on the new form to raw scores on the old form. On a third sheet of paper we plotted the unsmoothed raw-to-scale score conversions for the new form. To complete our work, we took over one of the small conference rooms in the Lindquist Building at ACT. We attached the very large graphs to the wall and also used some tabletops. Starting with the raw-to-raw score plot, we used a manual drawing tool to smooth this relationship. Whenever we changed points by smoothing, we modified the plotted points on the raw-to-scale score plot and the raw-to-scale score conversion table. My recollection is that the final conversions using smoothing tended to be closer to the conversions than the no-smoothing conversions. Based on this finding, we adopted this new hand-smoothing procedure for equating the ACT Assessment in 1981.

Analytic Smoothing For me, this hand-smoothing procedure was both tedious and subjective. So with Brennan’s support, I spent as much time as possible during the next year developing a procedure that avoided hand-smoothing. I searched the literature for smoothing methods that might be appropriate, and I consulted Angoff (1971) who discussed a method for smoothing score distributions described by Cureton and Tukey (1951). In this method, a smoothed frequency is a weighted average of nearby frequencies. I found that the resulting frequency distributions were still bumpy. In addition, the resulting frequency distributions tended to be systematically less skewed than the unsmoothed distribution. I also studied a method Lindsay and Prichard (1971) described that smoothed the raw-to-raw score equating relationship using a polynomial function. Because the raw-to-raw score relationship is smoothed, this method is referred to as a postsmoothing method. This method seemed promising, although the regression was not symmetric, which seemed problematic. I was concerned, too, that polynomials might not be flexible enough to fit all equating relationships that I would encounter. The smoothing method I developed used cubic splines, rather than polynomials. In addition, to deal with symmetry issues, one spline was found for linking scores on Form X to Form Y, a second spline was found for linking scores on Form Y to Form X, and these two spline functions were averaged. I also conducted research on this method, determined that it could handle a wide variety of equating relationships, and became convinced that the resulting

334 Michael J. Kolen

equating was more precise than the unsmoothed method. We began using this cubic spline postsmoothing method operationally in 1982, with the supporting research published soon after (Kolen, 1984). After developing the cubic spline method, I researched presmoothing methods, where the relative frequency distributions for Form X and Form Y are smoothed separately. For presmoothing methods, the Form Y equipercentile equivalents of the Form X scores are found using the smoothed relative frequency distributions. In Kolen (1991) I reported on research conducted with or by colleagues that compared the estimates of score distributions that included the Cureton and Tukey (1951) method referenced by Angoff (1971), a binomial kernel method I developed, a strong true score method Lord (1965) developed, and a log-linear method that Holland and Thayer (1987) recommended. Lord’s strong true score method and the log-linear method seemed most promising for smoothing test score distributions. Subsequently, my colleagues Hanson, Zeng, and Colton (1994) conducted research that suggested that both presmoothing and postsmoothing methods had the capacity to improve equating compared to no smoothing. The cubic spline postsmoothing and log-linear presmoothing methods appeared most promising for smoothing in test score equating. The cubic spline postsmoothing method has been used operationally with the ACT Assessment (ACT, 2019), and both the cubic spline postsmoothing method and the polynomial log-linear methods are used operationally with the SAT (College Board, 2017). In general, the development of smoothing techniques such as cubic spline postsmoothing and log-linear presmoothing are much less subjective than handsmoothing methods. In addition, these methods have been shown to be flexible and more accurate than no smoothing and many other smoothing methods.

IRT Equating Practical applications of IRT were provided in a 1977 special issue of the Journal of Educational Measurement. Lord (1977) discussed a few practical applications of IRT, including equating. Lord (1980) expanded his 1977 discussion of equating. Wright (1977) and Wright and Stone (1979) described equating procedures using the Rasch model.

Scale Transformation A step in IRT equating that is often necessary is to linearly transform IRT parameter estimates to place them on the same scale. Such a transformation is often used when data are collected using the CINEG design. Marco (1977) described methods for linearly transforming IRT scales using means and standard deviations of item parameter estimates. Stocking and Lord (1983) and Haebara (1980) provided more sophisticated scale transformation methods that are often referred to as characteristic curve methods.

History of Test Equating Methods 335

IRT True and Observed Score Equating After parameter estimates are placed on the same scale, Lord (1980) illustrated how raw scores on the new form could be equated to raw scores on an old form using IRT true score or IRT observed score equating. According to Lord (1980), IRT true score equating involves finding true score equivalents on the two forms. Lord (1980) stated “that there is no really appropriate way to make use of the truescore equating obtained. We do not know an examinee’s true score” (p. 202). Still, he suggested using this true score relationships with observed scores. Because of these issues with true score equating, Lord (1980) presented an IRT observed score equating method where the IRT model is used to provide smoothed score distributions for Form X and Form Y. These smoothed distributions are then equated using equipercentile methods. Lord (1980) provided an illustration of both of these equating methods. Lord (1980) asked, is using this observed score method “better than applying … true-score equating … to observed scores?” (p. 203). He then stated that there is no good criterion to help make such a choice.

IRT-Based Equating Criteria In addition to presenting equating methods, Lord (1980) developed criteria for equating. Following Angoff (1971) he stated that equating should be invariant across groups and that equating functions should be symmetric. He also added a criterion that he referred to as equity. He defined equity as follows: If an equating of tests x and y is to be equitable to each applicant, it must be a matter of indifference to applicants at every given ability level p(θ) whether they are to take test x or test y. (p. 195) He went on to say that equity implies that for a given p(θ), the distribution of observed scores on Form X converted to Form Y and Form Y must be the same. He concluded by stating the following: Under realistic regularity conditions scores x and y on two tests cannot be equated unless either (1) both scores are perfectly reliable or (2) the two tests are strictly parallel. (p. 198) Thus, equity is impossible under any realistic conditions. Near the end of Lord’s (1980) discussion of equating he states, “what is really needed is a criterion for evaluating approximate procedures, so as to be able to choose among them” (p. 207). Although Lord (1980) defined equity conditional on IRT p(θ), he could have defined equity conditional on true score due to the perfect functional relationship

336 Michael J. Kolen

between true score and theta when an IRT model holds. See Kolen and Brennan (2014, pp. 320–325) for further discussions of equity based on IRT and on classical congeneric models, Morris (1982) for a consideration of equity as conditional on true score, and Hanson (1991) for the conditions under which a true score version of Levine (1955) equating led to what Hanson (1991) referred to as firstorder equity.

Holland and Rubin (1982) Test Equating Volume In 1978, Educational Testing Service (ETS) initiated a project to investigate test equating methodology. This project culminated with the publication of the Holland and Rubin (1982) Test Equating volume. According to the Preface, the project was initiated for the following reason: With the advent of open testing and test disclosure legislation, the technical problems associated with test equating have received renewed interest and have been the subject of vigorous debate. The older methods of equating tests assumed various degrees of test security that are not necessarily compatible with current legislation. Thus test equating techniques, once viewed as obscure and specialized statistical methods of interest only to testing organizations, have been thrust into the public arena for scrutiny and sharp analysis. (p. xiii) According to Holland and Rubin (1982), “the papers in this book focus exclusively on the statistical aspects of test equating rather than on the problems of test construction” (p. 2). This volume included a summary of linear and equipercentile equating methods of equating by Angoff (1982) and a summary of IRT equating methods by Lord (1982). Among the rich work reflected in the volume is an investigation of a new equating design, section-pre-equating, by Holland and Wightman (1982), which they developed to address some of the issues associated with test disclosure legislation. In addition, Petersen, Marco, and Stewart (1982) provided an extensive empirical study of linear equating methods.

Braun and Holland’s Mathematical Analysis The chapter by Braun and Holland (1982) had a strong influence on my thinking about equating. In this chapter, Braun and Holland (1982) provided what they referred to as “a mathematical analysis of ETS equating procedures” that focused on linear and equipercentile equating methods for the random groups and CINEG designs. Their mathematical analysis led to notation that helped clarify the assumptions and derivations of the methods. Although I will not cover the

History of Test Equating Methods 337

entire chapter in detail, I will focus on a few developments that their mathematical analysis enabled. Braun and Holland (1982, p. 16) provided notation and an equation for equipercentile equating functions. They showed how the linear equating function is a special case of the equipercentile function when the equipercentile function is a straight line. Braun and Holland (1982, p. 20) introduced the concept of the synthetic population with the CINEG design. They began with the idea that examinees from one population took one form and that examinees from another population took the other form. They defined the synthetic population as a weighted combination of these two populations. The investigator defines these weights. Synthetic weights can be useful in reconciling differences in equations for the Tucker method that are presented in different sources. For example, Gulliksen’s (1950) version of the Tucker method assumes that the synthetic population is the population that was administered the new test form. Angoff’s (1971) presentation effectively uses synthetic weights proportionally equal to the sample sizes from each of the populations. Braun and Holland’s (1982) notion of a synthetic population also made clear that equipercentile and linear methods are necessarily population dependent. Braun and Holland (1982) provided a detailed description of the frequency estimation equipercentile method that Angoff (1971) originally described and that is used for equipercentile equating for the CINEG design, complete with a set of equations. Their derivation enabled further developments in equipercentile methods for the CINEG design (e.g. Jarjoura & Kolen, 1985; Kolen & Jarjoura, 1987). Braun and Holland (1982) also derived standard errors of linear equating without making a normality assumption. Previous standard error formulas, such as those Lord (1950) derived, assumed that test scores were normally distributed. This presentation of standard errors that did not assume normality enabled further developments in this area (e.g., Kolen, 1985).

Test Standards In editions prior to the 1985 Test Standards (AERA, APA, NCME, 1985), the topic of equating was not even mentioned. In the 1985 Test Standards, equating was covered in the chapter titled Scaling, Norming, Score Comparability, and Equating. The introduction of this chapter in the 1985 Test Standards made it clear that the term equating is for the situation where “alternate forms of a test are interchangeable in use” (p. 31). The introduction to this chapter went on to say that tests to be equated should not differ substantially in reliability, in the characteristics measured, in their content specifications, or in difficulty. Standard 4.6 of AERA, APA, and NCME (1985), indicated that details of the equating should be available,

338 Michael J. Kolen

including specific information about the method of equating: the administrative design and statistical procedures used, the characteristics of the anchor test, if any, and of the sampling procedures; information on the sample; and sample size. Periodic checks on the adequacy of the equating should be reported. (p. 34) Standard 4.8 of AERA, APA, and NCME (1985) focused on the use of the set of common items or anchor test. When using an anchor test, the characteristics of the anchor test should be described, particularly in its relation to the forms being equated. Content specifications for the anchor test and statistical information regarding the relationships between the anchor test and each form should be provided separately for the sample of people taking each form. (p. 34) Standards 4.6 and 4.8 reinforce many of the ideas that come from the references cited in this chapter. Subsequent editions of the Test Standards have an expanded scope compared to 1985. In particular, the 2014 Standards (AERA, APA, and NCME, 2014) contained a standard on the use of model-based methods in equating.

Post-1985 In this chapter, I present a history of equating that goes through 1985. For further information, Kolen and Brennan (2014) give a comprehensive treatment of equating (as well as linking and scaling), including detailed descriptions of many equating procedures. In particular, Kolen and Brennan (2014) provide descriptions of equating data collection designs that use item pools and IRT, references to standard errors of equating for a variety of equating methods, a discussion of how to choose smoothing parameters in equipercentile equating with analytic smoothing, references to a wide variety of research on equating, and an extensive discussion of practical issues in equating. The introduction to the present chapter stressed the importance of test development processes, including the use of detailed content and statistical specifications, in the development of alternate forms that are equated. As already noted, the importance of such test development procedures for test form equating were discussed by Flanagan (1939, 1951), Thorndike (1951), and Vaughn (1951). Although subsequent equating references acknowledge the importance of test content and statistical specifications for equating (e.g., Kolen & Brennan, 2014), specific procedures for developing content and statistical specifications and for assembling alternate forms to be equated have not received very much emphasis

History of Test Equating Methods 339

in the equating literature. Such subsequent references do not ignore test development process as much as they assume that the appropriate test development processes were properly established. When conducting equating, many complex choices need to be made. These choices can include which examinee groups are used to collect data for equating, what statistical methods are used such as smoothing method and IRT model, which items are used as common items, how to best minimize the effects of violations of assumptions, and how to document the equating. These sorts of issues need attention if the scores that result from application of equating methodology are to be comparable over multiple forms. The information in the present chapter can be supplemented by chapters in the third edition (Petersen, Kolen, & Hoover, 1989) and fourth edition (Holland & Dorans, 2006) of Educational Measurement, which update the general treatments by Angoff (1971) and Flanagan (1951). The monograph by von Davier, Holland, and Thayer (2004) presents kernel equating. In addition, the Holland and Dorans (2006) chapter and the Dorans, Pommerich, and Holland (2007) edited volume cover many issues in linking. Holland (2007) provides a history of linking. See Holland and Dorans (2006), Linn (1993), Mislevy (1992), and Kolen and Brennan (2014, pp. 487–536) for discussions of linking. von Davier (2011) focuses on statistical aspects of equating, scaling, and linking. Dorans and Puhan (2017) review research on linking, including equating, that has been conducted at ETS. When I entered the field of educational measurement in the 1970s, there were few resources for learning about equating other than Flanagan (1951), Angoff (1971), and a few ETS research reports. At that time, there were no available graduate courses or training workshops on equating and no generally available computer programs that could be used to conduct equating. Furthermore, the computing hardware that was available was primitive compared to what we have today. By the early 1980s the importance of equating began to receive considerable attention (Kolen & Brennan, 2014) due to an increase in the numbers and variety of testing programs that used multiple test forms, the need to address testing critics, and the accountability movement in education. Because of this increased importance, measurement professionals conducted a considerable amount of research on equating. In addition, Robert Brennan and I began presenting internal equating training sessions at ACT in the early 1980s and at the Annual Meetings of the National Council on Measurement beginning around 1990. We developed a graduate seminar on equating at the University of Iowa in the 1990s, and we published the first edition of the Kolen and Brennan (2014) text in 1995. As computers became more powerful, we made available to the public computer programs for conducting equating analyses (see Kolen & Brennan, 2014, pp. 559–560), and Brennan, Wang, Kim, and Seol (2009) provide C-computer code for most popular equating methods.

340 Michael J. Kolen

Advancements in equating procedures include research, training opportunities, and computing resources. Because of these advancements, measurement professionals now have substantial resources for learning about and implementing equating methods and practices.

Note 1 The author thanks Brian Clauser, Deborah Harris, and Timothy Moses for their reviews of an earlier version of this chapter.

References ACT (2019). ACT technical manual. Iowa City, IA: ACT. American Educational Research Association, American Psychological Association, National Council on Measurement in Education (AERA, APA, NCME) (1985) Standards for educational and psychological testing. Washington, DC: American Educational Research Association, American Psychological Association, National Council on Measurement in Education. American Educational Research Association, American Psychological Association, National Council on Measurement in Education (AERA, APA, NCME) (2014) Standards for educational and psychological testing. Washington, DC: American Educational Research Association, American Psychological Association, National Council on Measurement in Education. Angoff, W.H. (1953). Test reliability and effective test length. Psychometrika, 18, 1–14. Angoff, W.H. (1971). Scales, norms, and equivalent scores. In R.L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 508–600). Washington, DC: American Council on Education. Angoff, W.H. (1982). Summary and derivation of equating methods used at ETS. In P.W. Holland and D.B. Rubin (Eds.), Test equating (pp. 55–69). New York: Academic. Brennan, R.L., Wang, T., Kim, S., Seol, J. (2009). Equating recipes. Iowa City, IA: The University of Iowa. Braun, H.I., & Holland, P.W. (1982). Observed-score test equating: A mathematical analysis of some ETS equating procedures. In P.W. Holland and D.B. Rubin (Eds.), Test equating (pp. 9–49). New York: Academic. College Board (2017). SAT suite of assessments technical manual. New York: The College Board. Cureton, E.F., & Tukey, J.W. (1951). Smoothing frequency distributions, equating tests, and preparing norms. American Psychologist, 6, 404. Donlon, T.F., & Angoff, W.A. (1971). The Scholastic Aptitude Test. In W.A. Angoff (Ed.), The College Board Admissions Testing Program: A technical report on research and development activities relating to the Scholastic Aptitude Test and Achievement Tests (pp. 15– 45). New York: The College Board. Dorans, N.J., Pommerich, M., & Holland, P.W. (Eds.) (2007) Linking and aligning scores and scales. New York: Springer. Dorans, N.J., & Puhan, G. (2017). Contributions to score linking theory and practice. In R.E. Bennett & M. von Davier (Eds.) Advancing human assessment. Princeton, NJ: Educational Testing Service.

History of Test Equating Methods 341

Feldt, L.S., & Brennan, R.L. (1989). Reliability. In R.L. Linn (Ed.), Educational measurement (3rd ed., pp.105–146). New York: Macmillan. Flanagan, J.C. (1939). The cooperative achievement tests. A bulletin reporting the basic principles and procedures used in the development of their system of scaled scores. New York: American Council on Education Cooperative Test Service. Flanagan, J.C. (1951). Units, scores, and norms. In E.F. Lindquist (Ed.) Educational measurement (pp. 695–763). Washington, DC: American Council on Education. Gulliksen, H. (1950). Theory of mental tests. New York: Wiley. Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22, 144–149. Hanson, B.A. (1991). A note on Levine’s formula for equating unequally reliable tests using data from the common item nonequivalent groups design. Journal of Educational Statistics, 16, 93–100. Hanson, B.A., Zeng, L., & Colton, D. (1994). A comparison of presmoothing and postsmoothing methods in equipercentile equating (ACT Research Report 94-94). Iowa City, IA: American College Testing. Holland, P.W. (2007). A framework and history for score linking. In N.J. Dorans, M. Pommerich, & P.W. Holland (Eds.), Linking and aligning scores and scales (pp. 5–30). New York: Springer. Holland, P.W., & Dorans, N.J. (2006). Linking and equating. In R.L. Brennan (Ed.), Educational measurement (4th ed., pp. 187–220). Westport, CT: American Council on Education and Praeger. Holland, P.W., & Rubin, D.B. (Ed.) (1982). Test equating. New York: Academic. Holland, P.W., & Thayer, D.T. (1987). Notes on the use of log-linear models for fitting discrete probability distributions (Technical Report 87-79). Princeton, NJ: Educational Testing Service. Holland, P.W., & Wightman, L.E. (1982). Section pre-equating: A preliminary investigation. In P.W. Holland & D.B. Rubin (Eds.), Test equating (pp. 271–297). New York: Academic. Jarjoura, D., & Kolen, M.J. (1985). Standard errors of equipercentile equating for the common item nonequivalent populations design. Journal of Educational Statistics, 10, 143–160. Keats, J.A., & Lord, F.M. (1962). A theoretical distribution for mental test scores. Psychometrika, 27, 59–72. Kelley, T. L. (1914). Comparable measures. Journal of Educational Psychology, 5, 589–595. Kelley, T. L. (1923). Statistical methods. New York: Macmillan. Kolen, M.J. (1981). Comparison of traditional and item response theory methods for equating tests. Journal of Educational Measurement, 18, 1–11. Kolen, M.J. (1984). Effectiveness of analytic smoothing in equipercentile equating. Journal of Educational Statistics, 9, 25–44. Kolen, M.J. (1985). Standard errors of Tucker equating. Applied Psychological Measurement, 9, 209–223. Kolen, M.J. (1991). Smoothing methods for estimating test score distributions. Journal of Educational Measurement, 28, 257–282. Kolen, M.J., & Brennan, R.L. (2014). Test equating, scaling, and linking. Methods and practices (3rd ed.). New York: Springer. Kolen, M.J., & Jarjoura, D. (1987). Analytic smoothing for equipercentile equating under the common item nonequivalent populations design. Psychometrika, 52, 43–59. Levine, R. (1955). Equating the score scales of alternate forms administered to samples of different ability (Research Bulletin 55-23). Princeton, NJ: Educational Testing Service.

342 Michael J. Kolen

Lindquist, E.F. (Ed.) (1951) Educational measurement. Washington, DC: American Council on Education. Lindsay, C.A., & Prichard, M.A. (1971). An analytic procedure for the equipercentile method of equating tests. Journal of Educational Measurement, 8, 203–207. Linn, R.L. (1993). Linking results of distinct assessments. Applied Measurement in Education. 6, 83–102. Lord, F.M. (1950). Notes on comparable scales for test scores (Research Bulletin 50-48). Princeton, NJ: Educational Testing Service. Lord, F.M. (1965). A strong true score theory with applications. Psychometrika, 30, 239–270. Lord, F.M. (1977). Practical applications of item characteristic curve theory. Journal of Educational Measurement, 14, 117–138. Lord, F.M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Erlbaum. Lord, F.M. (1982). Item response theory and equating—A technical summary. In P.W. Holland and D.B. Rubin (Eds.), Test equating (pp. 141–149). New York: Academic. Marco, G.L. (1977). Item characteristic curve solutions to three intractable testing problems. Journal of Educational Measurement, 14, 139–160. Mislevy, R.L. (1992). Linking educational assessments: Concepts, issues, methods, and prospects. Princeton, NJ: ETS Policy Information Center. Morris, C.N. (1982). On the foundations of test equating. In P.W. Holland & D.B. Rubin (Eds.), Test equating (pp. 169–191). New York: Academic. Otis, A. S. (1916). The reliability of spelling scales, including a ‘deviation formula’ for correlation. School and Society, 4, 96–99. Otis, A. S. (1922). The method for finding the correspondence between scores in two tests. Journal of Educational Psychology, 13, 529–545. Petersen, N.S., Kolen, M.J., & Hoover, H.D. (1989). Scaling, norming, and equating. In R.L. Linn (Ed.), Educational measurement (3rd ed., pp. 221–262). New York: Macmillan. Petersen, N.S., Marco, G.L., & Stewart, E.E. (1982). A test of the adequacy of linear score equating models. In P.W. Holland & D.B. Rubin (Eds.), Test equating (pp. 71–135). New York: Academic. Stocking, M.L., & Lord, F.M. (1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7, 201–210. Thorndike, E. L. (1922). On finding equivalent scores in tests of intelligence. Journal of Applied Psychology, 6, 29–33. Thorndike, R.L. (1951). Reliability. In E.F. Lindquist (Ed.), Educational measurement (pp. 560–620). Washington, DC: American Council on Education. Thorndike, R.L. (Ed.) (1971) Educational measurement (2nd ed.). Washington, DC: American Council on Education. Vaughn, K.W. (1951). Planning the objective test. In E.F. Lindquist (Ed.), Educational measurement (pp. 159–184). Washington, DC: American Council on Education. von Davier, A.A. (Ed.) (2011) Statistical models for test equating, scaling, and linking. New York: Springer. von Davier, A.A., Holland, P.W., & Thayer, D.T. (2004). The kernel method of test equating. New York: Springer. Wright, B.D. (1977). Solving measurement problems with the Rasch model. Journal of Educational Measurement, 14, 97–116. Wright, B.D., & Stone, M.H. (1979). Best test design. Chicago, IL: MESA. Yoakum, C.S., & Yerkes, R.M. (1920). Army mental tests. New York: Holt.

15 A HISTORY OF RASCH MEASUREMENT THEORY George Engelhard Jr. and Stefanie A. Wind1

The history of science is the history of measurement (Cattell, 1893, p. 316)

Rasch published his classic text entitled: Probabilistic models for some intelligence and attainment tests in 1960. In this text, he presented his measurement theory, which has been called “a truly new approach to psychometric problems … [that yields] nonarbitrary measures” (Loevinger, 1965, p. 151) that “embody the essential principles of measurement itself, the principles on which objectivity and reproducibility, indeed all scientific knowledge, are based” (Wright, 1980, p. xix). van der Linden (2016) has suggested that the first chapter of Rasch’s book should be required reading for anyone seeking to understand the transition from classical test theory (CTT) to item response theory (IRT). In his words, “One of the best introductions to this change of paradigm is Rasch (1960, Chapter 1), which is mandatory reading for anyone with an interest in the subject” (van der Linden, 2016, p. xvii). Our chapter adds to the narrative of why Rasch measurement theory has received these accolades. There has been a steady increase in use of Rasch measurement theory as evidenced by citations summarized in Web of Science (Engelhard and Wang, 2020. Figure 15.1 shows the steady increase in citations over time. Between January 1, 1990 and September 9, 2019, there were 847 results from a Web of Science search using the topic phrase “Rasch Measurement Theory.” The top five application areas are Psychology (28%), Health Care Sciences and Services (15%), Educational Research (14%), Rehabilitation Sciences (9%), and Environmental and Occupational Health (9%). Other methods of summarizing the use of Rasch measurement theory across fields highlights a similar growth in Rasch-based applications. For example, Aryadoust, Tan, and Ng (2019) provide a detailed review of Rasch Measurement Theory in areas related to psychology, medicine,

344 George Engelhard, Jr. and Stefanie A. Wind

Number of citations Per Year (N=847) 100

Frequency

75 50 25 0 1985

FIGURE 15.1

1990

1995

2000 2005 Year

2010

2015

2020

Frequency of citations with theme of Rasch measurement theory (Web of Science, September 2019)

and education. Since the principles of Rasch measurement theory have multidisciplinary relevance, they have proliferated across a variety of fields. The purpose of this chapter is to describe the principles of Rasch measurement theory, while highlighting the historical development of this approach. We discuss the key philosophical and historical aspects of Rasch measurement theory with an emphasis on specific objectivity and invariant measurement. The chapter is organized around the following guiding questions:  

What is Rasch measurement theory? What are the major models and extensions of Rasch measurement theory?

We also include a third section that offers a few biographical considerations related to advancing Rasch measurement theory. Our main thesis is that Rasch measurement theory is not simply defined by a particular family of Rasch models or models that are labeled with his name. Rasch measurement theory is a framework for measurement that is defined by essential scientific principles based on the concepts of specific objectivity and invariance applied to models of measurement. Further, many of the research studies that cite Rasch measurement theory reflect continuing scientific progress that incorporates a careful consideration of measurement, statistical issues, and substantive issues in the social, behavioral and health sciences.

What is Rasch measurement theory? Science is impossible without an evolving network of stable measures Wright (1997, p. 33)

In this section, we describe how Rasch measurement theory is situated within the broader historical and philosophical framework of theories of measurement

A History of Rasch Measurement Theory 345

during the 20th century. Next, we describe the key concepts that have been identified as distinctive in his philosophy of measurement.

Where does Rasch fit in relationship to other theories of measurement? It is helpful to consider three broad traditions in measurement (test-score, scaling, and structural traditions) in order to gain a perspective on Rasch’s contributions to the history of ideas regarding measurement during the 20th century. Figure 15.2 shows a concept map reflecting these three traditions. One of the oldest traditions in measurement theory is based on the simple sum score. Classical test theory is illustrative of this tradition (Gulliksen, 1950). Classical test theory defines an observed score (sum score) as being composed of two components: a true score and error score. With a few simple assumptions, classical test theory can be used to obtain several useful indices of the psychometric quality of a set of scores related to the consistency, reliability, and precision of test scores (Traub, 1997). Measurement theories embedded within the test-score tradition have a total score focus. They are based on linear models that can be considered random effects models for estimating variance components (sources of error variance). The overarching goal of psychometric analyses based on this approach is to reduce noise and error variance in test scores. Classical test theory and other models within this tradition are tautological (Brennan, 1997), and empirical data cannot be used to falsify these models. In contrast to the scaling tradition, there is no underlying line or continuum reflecting a latent variable or construct. It

Measurement Theories

Test-Score Tradition

Scaling Tradition

Test-score focus Linear models Variance Goal: Reducing noise No continuum

Item-person response focus Non-linear models Invariance Goal: Increasing signal Continuum (line)

Classical Test Theory

Rasch Measurement Theory

Structural Tradition

Covariance focus Linear/non-linear models Invariance Goal: Explore relationships among latent variables

Factor Analysis

Path Analysis

Structural Equation Models

FIGURE 15.2

Three traditions of measurement: Test-Score, Scaling, and Structural Traditions

346 George Engelhard, Jr. and Stefanie A. Wind

should be noted that because the sum score is a sufficient statistic for estimating person locations in the Rasch model, there is a one-to-one correspondence between the ordering of persons using classical test theory and Rasch measurement theory. The next tradition is the scaling tradition. Rasch models are examples of measurement models within this tradition. The focus of measurement models in the scaling tradition reflects the development of probabilistic models for individual person responses to each item included in a measurement instrument. The models are non-linear (they use a logistic link function), and their major benefit is that they facilitate the development of an invariant scale to represent a latent variable or construct. Scaling models stress the goal of defining a line (linear scale) on which locations for items, persons, and other variables are estimated that has the potential to remain stable over a variety of conditions (e.g., different items, different persons, and different contexts). In contrast to the focus on noise (i.e., measurement error) in the test score tradition, a major goal in the scaling tradition is to increase the signal regarding the locations of persons and items on the underlying continuum. (See Engelhard (2013) for more detailed discussion of models classified into the test-score and scaling traditions.) Rasch measurement theory has a close relationship to item response theory. The creation of a continuum to represent a latent variable or construct is the defining characteristic of scaling models, and the Rasch model provides an approach that simultaneously locates both persons and items on the line (see Briggs, Chapter 12, this volume for a more detailed discussion of scaling). Measurement models within the structural tradition have a different focus compared to the other two traditions in measurement theory. First of all, structural models focus on reproducing covariance or correlation matrices (Bollen, 1989). In this sense, the estimation of person locations on a line is not a direct goal. Next, measurement models within the structural tradition have their genesis in traditional factor analysis and structural theories (Mulaik, 1972). Theories in the structural tradition can be traced back to Spearman (1904a, 1904b, 1907, 1910), and Spearman’s idea to disattenuate correlations for measurement errors using correlations among two parallel forms of a test (Traub, 1997). Measurement models within a structural tradition were also been considered by Thurstone (1931). Joreskog (1974) is a key example of the use of models in the structural tradition to address measurement issues. Research on validity has a long history of including principles from within the structural tradition (Loevinger, 1965; Messick, 1995). For example, the Test Standards (AERA, APA and NCME, 2014) include structural evidence with regards to providing validity evidence to support intended uses of test scores. Recent advances in the structural tradition include developments in factor analysis that include the use of non-linear models for dichotomous and polytomous data that blend the distinctions between the test-score and scaling

A History of Rasch Measurement Theory 347

traditions. The use of latent variables is based on a variety of measurement models can be used to explore structural relationships among these latent variables. As is well known, structural equation models can be viewed as path analysis combined with latent variables. It is our perspective that the use of Rasch measurement theory within a structural equation modeling approach is an exciting area for future developments in extending Rasch Measurement Theory.

What makes Rasch measurement theory a distinctive philosophy of measurement? Present day statistical methods are entirely group-centered, so that there is a real need for developing individual-centered statistics. (Rasch, 1961, p. 321)

Rasch was motivated by a concern with the development of individual-centered statistics. His solution to the problem of only focusing on group-centered statistics led him to propose a set of requirements for specific objectivity in individualcentered measurement:   



The comparison between two stimuli should be independent of which particular individuals were instrumental for the comparison; and it should also be independent of which stimuli within the considered class were or might also have been compared. Symmetrically, a comparison between two individuals should be independent of which particular stimuli within the class considered were instrumental for the comparison; and it should also be independent of which other individuals were also compared on the same or on some other occasion (Rasch, 1961, pp. 331–332)

The first two requirements in this list suggest that item calibrations (stimuli) should be invariant over the persons that are used to obtain the comparisons: person-invariant calibration of items. The last two requirements suggest that person measurement should be invariant over the particular items (stimuli) that are used to obtain the comparisons: item-invariant measurement of persons. Wright (1968) stressed the idea of objective measurement as being a key aspect of Rasch measurement theory. In his words, Progress will continue to be slow until we find a way to work with measurements which are objective, measurements which remain a property of the person measured regardless of the items he answers or the company he keeps. (p. 101)

348 George Engelhard, Jr. and Stefanie A. Wind

Wright (1968) stated his version of Rasch’s requirements for objective measurement as follows: First, the calibration of measuring instruments must be independent of those objects that happen to be used for calibration. Second, the measurement of objects must be independent of the instrument that happens to be used for the measuring. (p. 87) Engelhard (2013) stressed the close connections between objectivity and invariance. Based on Nozick (2001), objectivity includes several key features: accessibility, intersubjectivity, independence, and invariance. Objective statements are accessible from different angles implying that they can be repeated by different observers and at different times. Intersubjectivity implies that there is agreement among observers about a scientific fact. Next, objective statements are independent of the particular observers. Finally, invariance implies the first three features of accessibility, intersubjectivity, and independence. Box 15.1 lists five requirements for invariant measurement (Engelhard, 2013).

BOX 15.1 FIVE REQUIREMENTS OF INVARIANT MEASUREMENT (ENGELHARD, 2013) Item-invariant measurement of persons 1. 2.

The measurement of persons must be independent of the particular items that happen to be used for the measuring. A more able person must always have a better chance of success on an item than a less able person (Non-crossing person response functions).

Person-invariant calibration of test items 3. 4.

The calibration of the items must be independent of the particular persons used for calibration. Any person must have a better chance of success on an easy item than on a more difficult item (Non-crossing item response functions)

Unidimensional scale 5.

Items and persons must be simultaneously located on a single underlying latent variable (Wright Map)

A History of Rasch Measurement Theory 349

These requirements reflect the perspective that characterizes Rasch measurement theory. Invariance has been sought by many of the measurement theorists of the 20th century, such as Thurstone (1925, 1926), Guttman (1944, 1950) and Mokken (1971). Engelhard (2008) discussed the history of invariant measurement in a focus article presented with commentaries and reactions in a special issue of Measurement: Interdisciplinary Research and Perspectives (2008, Volume 6, Number 3). Rasch (1964) considered objectivity as one of the two major primary concepts for measurement (the other is comparisons). In his words, Looking then for concepts [of measurement] that could possibly be taken as primary it seems worth-while to concentrate upon two essential characteristics of “scientific statements”: 1. they are concerned with “comparisons”; 2. the statements are claimed to be “objective.” (Rasch, 1964, p. 2) In considering objectivity, Rasch (1977) used the term specific objectivity. He meant that the invariance properties of the Rasch model should be examined in detail (modeldata fit) to identify situations where the data fits the requirements of invariant measurement. The adjective “specific” refers to delineation of specific situations where the invariance is achieved or not achieved. This idea is also pointed out by Nozick (2001): What is objective about something, I have claimed, is what is invariant from different angles, across different perspectives, under different transformations. Yet often what is variant is what is especially interesting. We can take different perspectives on a thing (the more angles the better), and notice which of its features are objective and invariant, and also notice which of its features are subjective and variant. (p. 102) Rasch measurement theory provides a framework for examining the hypothesis that invariant measurement has been achieved. Rasch measurement theory provides the requirements of an ideal model that must be confirmed with fallible data collected to support the creation of a Rasch scale. Engelhard and Wang (2020) summarized Rasch’s philosophy as being the application of basic principles of science to measurement. Andrich (2018) has stressed that Rasch measurement theory reflects a distinctive epistemology. In an interview with Rasch, Andrich (2002) states that Rasch “believed his insight went beyond the matter of social science measurement, he believed the ingredient of an invariant comparison was an integral part of the possibility of knowledge itself” (Olsen, 2003, p. 151). Andrich (2018) argued that it is Rasch’s position on epistemology that is the basis of a case for Rasch measurement theory (p. 72). Invariant measurement includes several related concepts. Wright and Masters (1982) described invariant measurement using the concepts of objectivity,

350 George Engelhard, Jr. and Stefanie A. Wind

sufficiency, separability and additivity. Objectivity implies item-invariant measurement and person-invariant item calibration on a unidimensional scale. Sufficiency generally refers to the use of the simple sum score that represents all of the information needed for locating a person or item on a latent continuum. Sufficient statistics also allow for conditional maximum likelihood estimation that is not available for other IRT models (Andersen, 1977). Separability suggests that person and item parameters can be isolated and implies noncrossing item response functions and person response functions. This property of separability is akin to additive linear models with two factors (persons and items) that do not include an interaction term. Additivity implies that the measurement model connects the person and item parameters by addition or subtraction. This is sometimes called scale invariance. Another interesting set of concepts related to Rasch measurement theory can be connected to additive conjoint measurement (Luce and Tukey, 1964; Perline, Wright and Wainer, 1979). Rost (2001) explicitly considered the concepts that are essential for extensions of the Rasch model to be considered Rasch measurement theory. Here is his definition: a Rasch model is an item response model aimed at measuring one or more quantitative latent variables on a metric level of measurement, and that has the properties of sufficiency, separability, specific objectivity, and latent additivity. (p. 27) Von Davier and Carstensen (2007) considered the monotonicity properties, local independence, and sufficiency of total scores as key to understanding Rasch measurement theory. In summary, the concepts that define Rasch measurement theory can be generally viewed as related to invariant measurement including the application of scientific principles to measurement in the social, behavioral and health sciences. Many of the specific requirements of Rasch measurement theory, such as unidimensionality, are relaxed in some extended Rasch models (discussed later in this chapter). We believe that the basic epistemology of Rasch measurement theory reflected in invariant measurement forms the basis for considering the strengths and weaknesses of the extended Rasch models.

What are the major models and extensions of Rasch measurement theory? One answer to the question of “what is Rasch measurement theory?” is to list the family of Rasch models including various extensions of Rasch measurement theory. In this section, we briefly describe several Rasch models, as well as several extensions to Rasch measurement theory. Figure 15.3 provides a concept map for Rasch

A History of Rasch Measurement Theory 351

Rasch Measurement Theory

Family of Rasch Models

Dichotomous Partial Credit Rating Scale Binomial Poisson Facets Model FIGURE 15.3

Extensions of Rasch Models

Mixed Multilevel Multidimensional

Concept map for Rasch measurement theory

measurement theory. The family of Rasch models developed under Wright and his graduate students at The University of Chicago contributed substantially to our understanding of Rasch measurement theory. Wright and Masters (1982) describe the dichotomous, partial credit, rating scale, binomial, and Poisson models. The Facets model is described by Linacre (1989). These Rasch models are all unidimensional with probabilities modeled across adjacent categories. The operating characteristic function for these models is shown in Table 15.1. It is also helpful to view these models in their log-odds forms to highlight the relationships among these models. The log-odd forms are shown in Table 15.2. Table 15.3 lists a selection of major books that reflect developments in Rasch measurement theory over the past six decades. This timeline starts with the publication of Rasch’s book in 1960 and continues through books on Rasch measurement theory by Andrich and Marais (2019) and Engelhard and Wang (2021). In the early decades, research concentrated on conceptual and theoretical developments, as well as developing estimation methods for these models. Next, the focus turned to demonstrations of Rasch models for solving practical measurement problems. For example, the Objective Measurement book series was designed to publish exemplary research using Rasch measurement theory to solve important measurement problems (Engelhard and Wilson, 1996; Wilson, 1992, 1994; Wilson, Engelhard and Draney, 1997; Wilson & Engelhard, 2000). Turning now to extensions of Rasch measurement theory, Figure 15.3 highlights three major approaches for extending Rasch measurement theory. These are based on mixed, multilevel, and multidimensional models. Examples of

352 George Engelhard, Jr. and Stefanie A. Wind

TABLE 15.1 General form of the operating characteristic function for defining a family of

unidimensional Rasch measurement models Operating characteristic function P

expð  Þ

n jik nmiðkÞ ¼ Pnjiðk1Þnjikþ Pnjik ¼ 1 þ expð n jik Þ

Probability of moving across adjacent categories k-1 to k; k=1 to K Rasch Models

jik defined as

Dichotomous Binomial Trials

i k i þ ln ðmkþ1Þ

Poisson Counts Partial Credit Rating Scale Many Facet Partial Credit Many Facet Rating Scale

i þ lnðkÞ ik i þ k λj þ ik λ j þ i þ  k

Note: θn is the location of person n on the latent variable, δi is the location of item i on the latent variable, τk is category coefficient, λj is the location of rater j on the latent variable, and m is number of independent attempts in the binomial trials model.

TABLE 15.2 Log-odds format for family of Rasch models

Rasch Model

Log-odds format

Dichotomous

lnðPPni1 Þ ¼ n  i ni0

Partial Credit

nik lnðPPnik1 Þ ¼ n  ik

Rating Scale

nik lnðPPnik1 Þ ¼ n  i   k

Binomial Trials

k nik lnðPPnik1 Þ ¼ n  i  lnðmkþ1 Þ

Poisson Counts

nik lnðPPnik1 Þ ¼ n  i  lnðkÞ

Facets Model

njik lnðPnjik1 Þ ¼ n  j  i  k

P

Note: θn is the location of person n on the latent variable, δi is the location of item i on the latent variable and τk is category coefficient, λj is the location of rater j on the latent variable, and m is number of independent attempts in the binomial trials model.

models that reflect these general approaches for extending the Rasch model are described below. It should be noted that the structure of extended Rasch models is not easily combined under one general approach like the earlier family of Rasch models. Rost (1990) proposed a mixed Rasch model for combining latent class models with Rasch measurement theory. The basic premise of these mixed models is that

A History of Rasch Measurement Theory 353

TABLE 15.3 Selection of key books on Rasch measurement theory (1960 through 2020)

Dates

Authors

Titles

1960

Rasch (1960)

1970 1980

Wright & Stone (1979) Wright (1980) Wright & Masters (1982)

Probabilistic models for some intelligence and attainment tests Best Test Design: Rasch Measurement Rasch’s book republished with foreword and afterword by Wright Rating Scale Analysis: Rasch Measurement

Andrich (1988)

Rasch Models for Measurement

Linacre (1989)

Many-facet Rasch Measurement

Wilson (1992, 1994). (Ed).

Objective Measurement: Theory into Practice (Volumes 1–2) Rasch models: Foundations, recent developments, and applications Objective Measurement: Theory into Practice (Volume 3) Objective Measurement: Theory into Practice (Volume 4) Objective Measurement: Theory into Practice (Volume 5) Applying the Rasch Model: Fundamental measurement in the human sciences Constructing Measures: An item response modeling perspective Multivariate and mixture distribution Rasch models

1990

2000

Fischer & Molenaar (1995). (Eds). Engelhard & Wilson (1996). (Eds). Wilson, Engelhard, & Draney (1997) (Eds.) Wilson & Engelhard (2000). (Eds). Bond & Fox (2001) Wilson (2005)

2010

Von Davier & Carstensen (2007). (Eds). Garner, Engelhard, Fisher & Wilson (2010). (Eds). Brown, Duckor, Draney,& Wilson (2011). (Eds). Engelhard (2013) Engelhard & Wind (2018) Smith & Wind (2018) Andrich & Marais (2019)

2020

Engelhard & Wang (2021)

Advances in Rasch Measurement (Volume 1) Advances in Rasch Measurement (Volume 2) Invariant measurement: Using Rasch models in the social, behavioral, and health sciences Invariant measurement with raters and rating scales: Rasch models for rater-mediated assessments Rasch measurement models: Interpreting Winsteps and Facets Output. A course in Rasch measurement theory: Measuring in the educational, social and health sciences. Rasch models for solving measurement problems: Invariant measurement in the social sciences

Note: This list should not be considered exhaustive – it is reflective of our personal journeys in understanding Rasch measurement theory.

354 George Engelhard, Jr. and Stefanie A. Wind

Rasch measurement theory can be successfully applied within particular latent classes of persons. The parameters of the Rasch models can vary across latent classes, but good model-data fit is sought within each latent class. Rost (2001) suggested several extensions based on combining various Rasch models with latent class analyses. Adams and Wilson (1996) proposed a random coefficients multinomial logits (RCML) model that can be used as a general framework for estimating the Rasch models shown in Table 15.1. The RCML model provides item design matrices for a variety of Rasch models. It is also possible to estimate other models by using alternative item design matrices. Some of the models that can be estimated within this framework include the dichotomous Rasch model (Rasch, 1960/1980), the Partial Credit model (Masters, 1982), the Rating Scale model (Andrich, 1978), the linear logistic test model (Fischer, 1973), and the Facets model (Linacre, 1989). The RCML model can also be used to estimate multilevel models (Adams, Wilson, and Wu, 1997). Some examples of multilevel Rasch models have been provided by Kamata (2001), and Van den Noortgate, De Boeck, and Meulders (2003). Extensions of multilevel Rasch measurement models can be based on viewing items, persons and both persons and items as random effects (De Boeck, 2008). The RCML model has been extended to create a multidimensional model: the multidimensional random coefficients multinomial logits (MRCML) model (Adams, Wilson and Wang, 1997). The MRCML model provides the opportunity to estimate two or more dimensions based on a design matrix for persons, as well as a design matrix for items. The MRCML models offers flexible design matrices for items difficulties and the inclusion of multiple latent variables. This model has been used in international assessments, such as Organization for Economic Cooperation and Development (OECD), that administers the Programme for International Student Assessment (PISA, 2016). In this section, we identified several unidimensional extensions of Rasch measurement theory including RCML (Adams and Wilson, 1996), and several multilevel models (Adams, Wilson, and Wu, 1997; Kamata, 2001; Van den Noortgate, De Boeck, and Meulders, 2003). We also considered the mixed dichotomous Rasch model (Rost, 1990) as combining unidimensional Rasch models with latent class analysis. We also identified multidimensional extensions based on the MRCML model (Adams, Wilson, and Wang, 1997). An edited volume by Von Davier and Carstensen (2007) provides a description of additional extensions of the Rasch model.

Biographical Considerations Sokal (1984) pointed out that there are multiple approaches to the history of measurement (Engelhard, 1997). This chapter has focused primarily on the history of ideas and philosophy related to Rasch measurement theory. Another approach is based on biographical considerations of key measurement theorists. In this section, we suggest biographical resources for readers who may have an interest in Rasch from a more personal perspective. We also highlight the contributions of Ben

A History of Rasch Measurement Theory 355

Wright who played an important role in the development and advancement of the basic principles of Rasch measurement theory.

Georg Rasch Georg Rasch was a Danish mathematician and statistician. Rasch developed a theory of measurement based on the principles of specific objectivity. His research was enhanced by a strong scholarly relationship with Professor Benjamin D. Wright at The University of Chicago. Many biographical details of Rasch’s life are available in Wright’s foreword to the reissued version of Rasch’s book in 1980 (Wright, 1980). Another important resource is an interview conducted with Rasch by Andrich (1995). Rasch’s obituary by Andersen (1982) offers additional details of Rasch’s life. Olsen (2003) wrote a dissertation on Rasch that describes professional and personal details of his contributions to statistics.

Benjamin D. Wright Benjamin Wright at The University of Chicago introduced many of us in the measurement community to Rasch measurement theory. Wright was a major proponent of and contributor to the development of Rasch measurement theory. Wright worked with numerous students and colleagues all over the world during his career. Current computer software including Winsteps and Facets (Smith & Wind, 2018) can trace their roots back to Wright’s early algorithms for estimating parameters for various members of the family of Rasch models. Many of the personal details of Wright’s life are considered in Wilson and Fisher (2017). Recently, Smith (2019) presented a timeline that highlights key players in developing Rasch measurement theory. The biographical considerations in this section focus on developments in the United States because of our personal biographies (see Andersen [1982] for details on international developments). Readers who are interested in scholarly family trees related to Rasch measurement theory can refer to Smith (2019) and Wijjsen, Borsboom, Cabaco and Heiser (2019). Andrich (2004) has pointed out that controversies over Rasch measurement theory can be viewed from the perspective of incompatible paradigms. Paradigm shifts inevitably include conflicting perspectives and disagreements among scholars (Kuhn, 1962). An example of this conflict is illustrated by the so-called “Rasch Wars” in language testing (McNamara & Knoch, 2012). There are other examples of resistance and eventual acceptance of the Rasch measurement theory within a variety of fields. There are clearly multiple histories of measurement and these various perspectives form a mosaic that can contribute not only to our knowledge of progress in measurement but also to our knowledge of progress in the social sciences more generally (Engelhard, 1997). This chapter on the history of Rasch measurement theory is meant to add to this mosaic.

356 George Engelhard, Jr. and Stefanie A. Wind

Summary and Discussion The concept of “objectivity” raises fundamental problems in all sciences. For a statement to be scientific, “objectivity” is required. Rasch (1964, p. 1)

We opened this chapter with a quote from Cattell (1893) who argued that the history of science is the history of measurement. In many ways, the continuing influences of Rasch measurement theory reflect the recognition that measurement is a fundamental aspect of the social sciences, and that increased attention is needed in developing psychometrically sound indicators of the key constructs in our substantive theories. Rasch measurement theory offers a strong measurement base for researchers in the social, behavioral, and health sciences who care about building substantive theories that inform both theory and practice. Rasch measurement theory can be productively viewed as a paradigm shift (Andrich, 2004), but it can also be conceptualized as reflecting systemic progress in the application of scientific principles to problems in measurement. In many ways, the coalescence of the test score and scaling traditions within the structural tradition represents the merging of measurement and statistical methods with substantive research. Maul, Mari, Torres Irribarra, and Wilson (2018) discuss evaluating measurement from a structural perspective. On the one hand, Rasch measurement theory can be viewed as a comprehensive family of statistical models that are united by the requirements of specific objectivity and the quest for invariance. On the other hand, Rasch measurement theory can be viewed as applying the basic principles that underlie scientific methodology in general to measurement (Rasch, 1977). There are multiple approaches to history; accordingly, it should be highlighted that there is not one definitive history – much as in the classic movie Rashomon, historians integrate their own narratives into personal histories. Rasch (1960/1980) pointed out “the model is not true … no models are – not even Newtonian laws. Models should not be true, but it is important that they are applicable” (pp. 37–38). This is similar to the aphorism by Box (1976) that “Since all models are wrong the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad” (p. 792). We can borrow this point and say that all histories are wrong – we are developing a story that we hope has the potential to guide our progress in measurement.

Note 1 The authors would like to thank Jue Wang, William Fisher, and Richard Luecht for their helpful comments on an earlier draft of this chapter.

A History of Rasch Measurement Theory 357

References Adams, R. J., & Wilson, M. R. (1996). Formulating the Rasch model as a mixed coefficients multinomial model. In G.EngelhardJr. & M. R. Wilson (Eds.), Objective measurement: Theory into practice (Vol. 3, pp. 143–166). Norwood, NJ: Ablex. Adams, R. J., & Wu, M. L. (2007) The Mixed-Coefficients Multinomial Logit Model: A Generalized Form of the Rasch Model. In M. von Davier and C. H. Carstensen (Eds.), Multivariate and mixture distribution Rasch models: Extensions and applications. (pp. 57–75). New York: Springer. Adams, R. J., Wilson, M., Wang, W. (1997). The Multidimensional Random Coefficients Multinomial Logit Model. Applied Psychological Measurement, 21(1), 1–23. Adams, R. J., Wilson, M. R., & Wu, M. L. (1997). Multilevel item response modelling: An approach to errors in variables regression. Journal of Educational and Behavioral Statistics, 22, 47–76. American Educational Research Association, American Psychological Association, and National Council on Measurement in Education (2014). Standards for educational and psychological testing. Washington, DC: AERA. Andersen, E. B. (1977). Sufficient statistics and latent trait models. Psychometrika, 42, 69–81. Andersen, E. B. (1982) Georg Rasch (1901–1980). Psychometrika 47, 375–376. Andrich, D. A. (2018). A Rasch measurement theory. In F. Guillemin, A. Leplege, S. Briancon, E. Spitz, and J. Coste (Eds.), Perceived health and adaptation in chronic disease. (pp. 66–91). New York: Routledge. Andrich, D. (2004). Controversy and the Rasch model: A characteristic of incompatible paradigms? Medical Care, 42(1), I7–I16. Andrich, D. (1995). Rasch and Wright: The early years (transcript of a 1981 interview with Ben Wright). In J. M. Linacre (Ed.), Rasch Measurement Transactions, Part 1 (pp. 1–4 [http://www.rasch.org/rmt/rmt0.htm]). Chicago, IL: MESA Press. Andrich, D. A. (1988). Rasch models for measurement. Newbury Park, CA: Sage. Andrich, D. A. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561–573. Andrich, D., & Marais, I. (2019). A course in Rasch measurement theory: Measuring in the educational, social and health sciences. Singapore: Springer. Aryadoust, V., Tan, H. A. H., & Ng, L. Y. (2019). A Scientometric Review of Rasch Measurement in Psychology, Medicine, and Education: The Rise and Progress of a Specialty. Frontiers of Psychology, 10, 2197. doi:10.3389/fpsyg.2019.02197. Baker, F. B., & Kim, S. (2004). Item response theory: Parameter estimation techniques. Second edition, Revised and expanded. New York: Marcel Dekker. Bollen, K.A. (1989). Structural equations with latent variables. New York: Wiley. Bond, T. G., & Fox, C. M. (2001). Applying the Rasch model: Fundamental measurement in the human sciences. Mahwah, NJ: Erlbaum. Box, G. E. P. (1976). Science and statistics. Journal of the American Statistical Association, 71 (356), 791–799. Brennan, R. L. (1997). A perspective on the history of generalizability theory. Educational measurement: Issues and practice, 16(4), 14–20. Brown, N. J. S., Duckor, B., Draney, K., & Wilson, M. (2011). Advances in Rasch Measurement, Volume 2. Maple Grove, MN: JAM Press. Cattell, J. M. (1893). Mental measurement. Philosophical Review, 2, 316–332. De Boeck, P. (2008). Random IRT models. Psychometrika, 73(4), 533–559.

358 George Engelhard, Jr. and Stefanie A. Wind

Engelhard, G. (Ed.) (1997). Introduction to special issue on history of measurement theory. Educational Measurement: Issues and Practice, Summer, 5–7. Engelhard, G. (2008). Historical perspectives on invariant measurement: Guttman, Rasch, and Mokken [Focus article]. Measurement: Interdisciplinary Research and Perspectives, 6, 1–35. Engelhard, G. (2012). Rasch measurement theory and factor analysis. Rasch Measurement Transactions, 26, 1375. Engelhard, G. (2013). Invariant measurement: Using Rasch models in the social, behavioral, and health sciences. New York: Routledge. Engelhard, G., & Wang, J. (2020). Developing a concept map for Rasch measurement theory. In M. Wiberg, S. Culpepper, R. Janssen, J. González, & D. Molenaar (Eds.) Quantitative Psychology: The 84th Annual Meeting of the Psychometric Society (pp. 19–29). New York: Springer. Engelhard, G., & Wang, J. (2021). Rasch models for solving measurement problems: Invariant measurement in the social sciences. New York: Sage. Engelhard, G., & Wilson, M. (Eds.). (1996). Objective Measurement: Theory into Practice, Volume 3. Norwood, NJ: Ablex. Engelhard, G., Wilson, M., & Draney, K. (Eds.) (1997). Objective Measurement: Theory into Practice, Volume 4. Norwood, NJ: Ablex. Engelhard, G., & Wind, S. A. (2018). Invariant measurement with raters and rating scales: Rasch models for rater-mediated assessments. New York: Routledge. Fischer, G. (1973). The linear logistic mode as an instrument in educational research. Acta Psychologica, 37, 359–374. Fischer, G. & Molenaar, I. W. (Eds.) (1995). Rasch models: Foundations, recent developments, and applications. New York: Springer-Verlag. Garner, M., Engelhard, G., Wilson, M., & Fisher, W. (Eds.) (2010). Advances in Rasch Measurement, Volume 1. Maple Grove, MN: JAM Press. Gulliksen, H. (1950). Theory of mental tests. New York: Wiley. Guttman, L. (1944). A basis for scaling qualitative data. American Sociological Review, 9(2), 139–150. Guttman, L. (1950). The basis for scalogram analysis. In S. A. Stouffer, L. Guttman, E. A. Suchman, P. F. Lazarsfeld, S. A. Star, and J. A. Clausen (Eds.), Measurement and Prediction (Volume IV, pp. 60–90). Princeton, NJ: Princeton University Press. Joreskog, K. G. (1974). Analyzing psychological data by structural analysis of covariance matrices. In D. H. Krantz, R. C. Atkinson, R. D. Luce, & P. Suppes (Eds.), Contemporary developments in mathematical psychology (Vol. 2, pp. 1–56). San Francisco: W.H. Freeman. Kamata, A. (2001). Item analysis by the hierarchical generalized linear model. Journal of Educational Measurement, 38(1), 79–93. Kuhn, T. S. (1962). The Structure of Scientific Revolutions, Chicago: University of Chicago Press. Laudan, L. (1977). Progress and its problems: Toward a theory of scientific change. Berkeley, CA: University of California Press. Linacre, J. M. (1989). Many-facet Rasch measurement. Chicago: MESA Press. Loevinger, J. (1965). Person and population as psychometric concepts. Psychological Review, 72, 143–155. Luce, R. D., & Tukey, J. W. (1964). Simultaneous conjoint measurement: A new type of fundamental measurement. Journal of Mathematical Psychology, 1, 1–27. Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–174.

A History of Rasch Measurement Theory 359

Maul, A., Mari, L., Torres Irribarra, D., & Wilson, M. (2018). The quality of measurement results from a structural perspective. Measurement, 116, 611–620. McNamara, T., & Knoch, U. (2012). The Rasch wars: The emergence of Rasch measurement in language testing. Language Testing, 29(4), 555–576. Messick, S. (1995). Validity of psychological assessment: validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741–749. Mokken, R. J. (1971). A theory and procedure of scale analysis. The Hague: Mouton / Berlin: De Gruyter. Mulaik, S. A. (1972). The foundations of factor analysis. New York: McGraw Hill. Nozick, R. (2001). Invariances: The structure of the objective world. Cambridge, MA: The Belknap Press of Harvard University Press. Olsen, L.W. (2003) Essays on Georg Rasch and his contributions to statistics. Unpublished PhD thesis at the Institute of Economics, University of Copenhagen. Perline, R., Wright, B. D., & Wainer, H. (1979). The Rasch Model as Additive Conjoint Measurement. Applied Psychological Measurement, 3(2), 237–255. PISA (2016), PISA 2015 Results (Volume I): Excellence and Equity in Education. Paris: OECD Publishing. Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests. Copenhagen: Danish Institute for Educational Research. (Expanded edition, Chicago: University of Chicago Press, 1980). Rasch, G. (1961). On general laws and meaning of measurement in psychology. In J. Neyman (Ed.), Proceedings of the fourth Berkeley Symposium on mathematical statistics and probability (pp 321–333). Berkeley: University of California Press. Rasch, G. (1964). On Objectivity and Models for Measuring. Lecture notes edited by Jon Stene. https://www.rasch.org/memo196z.pdf. Rasch, G. (1977). On specific objectivity: An attempt at formalizing the request for generality and validity of scientific statements. Danish Yearbook of Philosophy, 14, 58–94. Rost, J. (2001). The growing family of Rasch models. In A. Boomsa, M. A. J. van Duijn, and T. A. B. Snijders (Eds.). Essays on item response theory (pp. 25–42). New York: Springer. Rost, J. (1990). Rasch models in latent class analysis: An integration of two approaches to item analysis. Applied Psychological Measurement, 14, 271–282. Smith, R. M., & Wind, S. A. (2018). Rasch measurement models: Interpreting Winsteps and Facets Output. Maple Grove, MN: JAM Press. Smith, R. M. (2019). The Ties that Bind. Rasch Measurement Transactions, 32(1), 27–29. Sokal, M. M. (1984). Approaches to the history of psychological testing. History of Education Quarterly, 24(3), 419–430. Spearman, C. (1904a). “General intelligence,” objectively determined and measured. American Journal of Psychology, 15, 201–293. Spearman, C. (1904b). The proof and measurement of association between two things. American Journal of Psychology, 15, 72–101. Spearman, C. (1907). Demonstration of formulae for true measurement of correlation. American Journal of Psychology, 18, 160–169. Spearman, C. (1910). Correlation calculated from faulty data. British Journal of Psychology, 3, 271–295. Thurstone, L. L. (1925). A method of scaling psychological and educational tests. Journal of Educational Psychology, 16, 433–451.

360 George Engelhard, Jr. and Stefanie A. Wind

Thurstone, L. L. (1926). The scoring of individual performance. Journal of Educational Psychology, 17, 446–457. Thurstone, L.L. (1931). The reliability and validity of tests. Ann Arbor, MI: Edwards. Traub, R. (1997). Classical test theory in historical perspective. Educational Measurement: Issues and practice, 16(10), 8–13. Van den Noortgate, W., De Boeck, P., & Meulders, M. (2003). Cross-classification multilevel logistic models in psychometrics. Journal of Educational and Behavioral Statistics, 28, 369–386. van der Linden, W. J. (Ed.) (2016). Preface. Handbook of Item Response Theory, Volume Two: Models (pp. xviii–xix). Boca Raton, FL: CRC Press. van der Linden W. J., & Hambleton R. K. (1997). Item Response Theory: Brief History, Common Models, and Extensions. In W. J. van der Linden, and R. K. Hambleton (Eds.) Handbook of Modern Item Response Theory (pp. 1–28). New York: Springer. Von Davier, M. & Carstensen, C. H. (Eds.) (2007). Multivariate and mixture distribution Rasch models: Extensions and applications. New York: Springer. Wijjsen, L. D., Borsboom, D., Cabaco, T., & Heiser, W. J. (2019). An academic genealogy of Psychometric Society presidents. Psychometrika, 84(2), 562–588. Wilson, M. (2005). Constructing measures: An item response modeling approach (2nd edition). Mahwah, NJ: Erlbaum. Wilson, M. (Ed.) (1992). Objective Measurement: Theory into Practice, Volume 1. Norwood, NJ: Ablex. Wilson, M. (Ed.) (1994). Objective Measurement: Theory into Practice, Volume 2. Norwood, NJ: Ablex. Wilson, M., & Engelhard, G. (Eds.). (2000) Objective Measurement: Theory into Practice, Volume 5. Stamford, CT: Ablex. Wilson, M., & Fisher Jr, W. P. (Eds.) (2017). Psychological and Social Measurement: The Career and Contributions of Benjamin D. Wright. New York: Springer. Wolfe, E. W. (2013). A bootstrap approach to evaluating person and item fit to the Rasch model. Journal of Applied Measurement, 14(1), 1–9. Wolfe, E. W., & Smith, E. V. (2007). Instrument development tools and activities for measure validation using Rasch models: Part II–validation activities. Journal of Applied Measurement, 8(2), 204–234. Wright, B. D. (1968). Sample-fee test calibration and person measurement. In Proceedings of the 1967 Invitation Conference on Testing Problems (pp. 85–101). Princeton, NJ: Educational Testing Service. Wright, B. D. (1980). Foreword and Afterword. In G. Rasch (1960/1980), Probabilistic models for some intelligence and attainment tests (pp. ix–xix, pp. 185–194). Copenhagen: Danish Institute for Educational Research. (Expanded edition, Chicago: University of Chicago Press, 1980). Wright, B. D. (1997). A history of social science measurement. Educational measurement: issues and practice, 16(4), 33–45. Wright, B. D., & Masters, G. N. (1982). Rating scale analysis: Rasch measurement. Chicago: MESA Press. Wright, B. D., & Stone, M. H. (1979). Best test design: Rasch measurement. Chicago: MESA Press.

INDEX

Abelson, A. R. 171 absolute error 208, 210, 213–14, 216, 221 absolute error variance 210, 216, 221 accommodations, 34–6 accountability, 31–4 accountability testing 136–7, 147–150 achievement level descriptor (ALD) 80 ACT 79, 142, 332 ACT, Inc. 142 adaptive testing 50, 57, 77, 247–8, 252–4, 303, 308 adequate yearly progress (AYP) 74 admissions testing 46, 142, 144–5, 147, 150 Algina, James 54–5, 218, 303 alignment 34, 56, 74, 76, 95, 104 American Council on Education (ACE) 47, 195 American Educational Research Association (AERA) 112, 114–16, 120, 150, 217 American Institutes for Research (AIR) 79 American Psychological Association (APA) 66, 150, 185 American Psychologist 52 Americans with Disabilities Act (ADA) 35 An Essay Towards Solving a Problem in the Doctrine of Chances 294 Anabaptists 3 analysis of variance (ANOVA) 162, 164–5, 207 Anastasi, Anne 191, 193 anchored calibration 249 Andrich, David 240, 349, 351, 355

Angoff, William 50, 58–9, 80, 182, 190, 278, 285, 319, 325–6, 328–37 annual measurable achievement outcome (AMAO) 94 anthropometric laboratory 160–1 Applications of Item Response Theory to Practical Testing Problems 247 Applied Measurement in Education 75 Army Alpha 44–46, 65–66, 70, 78, 123, 129, 137, 139, 319–20 Army Beta 65–66 ARPANET 3 Association of Black Psychologists 126 automated test assembly 247, 252 balanced incomplete block (BIB) design 79 Bateson, William 166 Bayes, Thomas 57, 173, 200, 239, 241, 243–6, 254, 292 Bayes’ theorem 293 belief networks 305 Bell Curve, The 136, 141–2 bias 31 Bible, The 3–5 BICAL 245 Big Test: The Secret History of the American Meritocracy, The 145 BIGSTEPS 245 bilingual 69–71, 87, 92–101 Bilingual Education Act 87, 92–4 bilingual/translated assessments 99, 101 BILOG 241

362 Index

Binet, Alfred 18–9, 43–4, 66, 88, 121–3, 126–7, 137, 176 Binet-Simon scale 19, 88, 121 Bingham, Walter Van Dyke 66–7 Biometrika 170 Birnbaum, Alan xxiv, 235, 237, 243, 247, 249, 280 Board of Regents of the State of New York 16, 19 Bock, Darrell 233, 237, 239–41, 243–5, 254, 300 Boer War 169 Bootstrap 222 Boston Latin School (Latin Grammar School) 4 Brennan, Robert 55, 206, 216, 218, 222–3 Brigham, Carl 46–7, 138–9, 141–2 Brillinger, David 297 British Journal of Psychology 170–1 Broca, Paul 18 Brown v. Board of Education 92 Brown, William 170 Buck v. Bell 123 BUGS 306 Burt, Cyril 212 Bush, George H. W. 96 California Achievement Tests (CAT) 45 California Test Bureau (CTB) 45–6 Cambridge, Massachusetts 4 Cambridge, England 16, 43 Cambridge University 158, 165 Campbell, Donald 187 Campbell, Norman 268 Carnegie Foundation 47 Case Against the SAT, The 144–5 Cattell, James M. 67 Chauncey, Henry 47, 150 Civil Rights Act of 1964 92, 140 classical test theory (CTT) 157–80 Clauser, Brian 219, 222, 223 coaching 30, 143, 319 coefficient alpha (ɑ) 49, 157, 170, 175, 214 coefficient kappa (κ) 55 College Board 23–4, 26, 29–36, 47, 139, 142, 143, 145, 334 college entrance assessments (SAT and ACT), 22–34 Colton, Dean 219, 332, 334 Columbia University 43–4, 167 Committee on Scaling Theory and Methods 265

Committee on the Examination of Recruits 65, 67 Common Core State Standards 75 computer adaptive tests (CAT) 50, 77 computer-based accommodations 102 Conant, James Bryant 47 conditional standard error or measurement (CSEM) 213 construct equivalence 117–18, 120 construct underrepresentation 113–14, 192 Construct Validity in Psychological Tests 185 Construction and Use of Achievement Examinations, The 283 construct-irrelevant variance 98, 113–14, 116, 119, 192, 196 Cooperative Achievement Tests 321 corona virus pandemic 147 correlation xxiii, 44, 49–50, 53, 56, 161–2, 164–74 183, 187, 191 and equating 321 and generalizability 216, 218 and individual differences 286 intraclass 212 item-test 324 matrices 346 partial 144 product-moment 233–4 residualized 246 and scaling 277 Council of Chief State School Officers (CCSSO) 32, 75 craniology 18 Crick, Joe 217 criterion referenced test (CRT) xxiii, 42–60, 73, 216, 276, 283, 303 criterion-referenced scale interpretation 45, 276 Cronbach, Lee xxiv, 50, 136, 139, 157, 172, 174–5, 182–201, 206–223, 299 Crossroads in the Mind of Man 173 Cureton, Edward 182–5, 200, 329, 333–4 D constant 237–8 D study 209–11 Darwin, Charles 18, 66, 123, 157–160 Darwin, George 159 Data Recognition Corporation (DRC) 45–6 data-model fit 245–6, 255 De Finetti, Bruno 295 De Finetti’s theorem 296–7, 300–1 Debra P. v. Turlington 74, 116 declamation (public declamation) 8

Index 363

Demonstration of Formulae for True Measurement of Correlation 170 dependability 206, 216–17 derived measurement procedure 268 Dewar, James 161 Diana v. State Board of Education 126 Dickson, J. Hamilton 161 Die Normaltӓuschungen in der Lagewahrnehmungla 169 differential item functioning (DIF) 31, 117, 120, 246 differential test functioning 117 Doll, Edgar A. 66 Drapers’ Company Research Memoir 170 DuBois, W. E. B. 137 dunce caps 7 Dunster, Henry 4 Dynamic Learning Maps (DLM) 75 Ebel, Robert 51 Economic Opportunity Act of 1964 27 Edgeworth, Francis 162–5 educable mentally retarded (EMR) 127, 140 Education Commission of the States (ECS) 78, 145 Education for All Handicapped Children Act 71 Educational and Psychological Measurement 174, 233 Educational Consolidation and Improvement Act (ECIA) 72 Educational Measurement (Editions 1–4) 51, 81, 182, 187, 193, 195, 218, 281, 283, 285, 323, 328, 339 Educational Testing Service 29, 35, 47, 79, 142, 282, 336 efficiency movement xxii, 66 Eisenhower, Dwight D. 67–8 Elementary and Secondary Education Act (ESEA) 32, 69, 92, 147 Elements of Generalizability Theory 217 Embretson, Susan 190, 304 English as a second language (ESL) 93 English learners (EL) 34–6, 87–105 Enhanced Assessment Grants 75 Equal Opportunity Act 93 equating xxiv, 24, 50–1, 80, 220, 247–9, 256, 281, 283, 318–40 equating error 331 error/tolerance 210, 221 error variance 210, 213, 216, 221, 247, 251, 253, 330, 345

eugenics 122–3, 142, 158, 165–7, 171, 173, 176, 277 Eugenics Committee of the United States of America 173 Every Student Succeeds Act (ESSA) xxiii, 32, 67, 76–7, 95, 147 evidence centered design 220 exam, oral 13, 91 exam, written xxiii, 8, 13, 16, 18, 129, 151 exchangeability 296, 300–2, 307–8 expectation maximization algorithm 244 expert systems 305 extended time 34–5, 102 facets model 351, 354 factor analysis xxiv, 173, 185, 190, 246, 296, 346 fairness 26, 111–31, 143, 150–1, 181, 184, 190, 195, 201 Fechner, Gustav 176, 264, 270, 286 Fechner’s Law 270, 273 Feldt, Leonard 332 fingerprints 158 Fisher, R. A. xxiv, 164, 166 fixed facets 208 Flanagan, J. C. 51 frequentist 293, 295, 298, 300–1, 304–8 fundamental measurement procedure 267–8, 280 G. I. Bill 27 G study 209–11, 223 Gao, Xiaohong 219, 222 Galton, Francis 43, 158–62, 264, 274–8, 286 “General Intelligence” Objectively Determined and Measured 169 generalizability theory 157, 162, 164, 175, 184, 206–24 Generalizability Theory: A Primer 218 generalized partial credit model 240 GENOVA 217, 220–1 Gibbs sampling 306 Glaser, Robert 51–2 Gleser, Goldine 206, 211, 299 Goals 2000 73, 96 Goddard, Henry 43, 65–7, 122–3 Golden Rule Settlement 127 graded response model 238 Graduate Management Admissions Council (GMAC) 29 raduate Record Examination (GRE) 49–50 graphical models 305

364 Index

Griggs v. Duke Power Co. 127 Guilford, J. P. 50 Gulliksen, Harold 157, 233, 265 Guttman, Louis 280 Haertel, Edward 56 Haines, Thomas H. 66 Hambleton, Ronald 56 Hammock, Joseph 51 handicapped 69–71 Handbook of Modern Item Response Theory 241 Harvard Educational Review 139, 144 Harvard University 4, 9, 47, 139, 144, 172–3, 217 Hawkins-Stafford Act 73 Henri, Victor 43 Hereditary Genius 159 Herrnstein, Richard 141 higher-order thinking skills (HOTS) 73–4, 147 Hiskey-Nebraska Test of Learning Aptitude 48 Hitler, Adolf 167 Hooker, Joseph 158, 160 hornbook 5 Horst, Paul 265 Houghton Mifflin 46 Howe, Samuel Gridley 9, 18 Huguenots 4 Human Resources Research Organization (HumRRO) 79 Husek, Theodore R. 52, 54–5 hyperparameters 301 Improving America’s Schools Act (IASA) 73 Individuals with Disabilities Education Act (IDEA) 74–6, 98 information function 247–53 Inquiries into Human Faculties 276 intelligence quotient (IQ) 43–4 intelligence testing 18, 82, 87–91, 94, 97, 103, 121–3, 127, 130, 136–9, 150 intelligent tutoring system 304–5 International Health Exhibition 160 Interpretation of Educational Measurements 173 interpretive argument 50, 198 interval scale 269, 277 invariance 118, 120, 236, 245–6, 255–6, 278–80, 284, 286, 323, 344, 348–50, 356 inverse probability 294–5, 308 Iowa Academic Meet 27 Iowa Test of Basic Skills (ITBS) 91, 282

Iowa Test of Educational Development (ITED) 27, 282 item banking 248–9 item characteristic curve (ICC) 233 item characteristic curve theory 233 item response theory (IRT) xxiv, 49, 59, 79, 157, 175, 223, 232–56 280–1, 285, 296, 319, 332, 343, 346 jackknife 217, 222 Jarjoura, D. 219, 221, 239 Jeffreys, Harold 295 Jensen, Arthur 136, 139, 141 Johnson, Lyndon B. 69 joint maximum likelihood estimation (JMLE) 241–2 Journal of Educational Measurement 233, 334 just noticeable difference (JND) 271 Kane, Michael 50, 216 Karlin, John 265 Kelley, Truman 112–13, 298 Kelley’s formula 298–9, 301–2 Kennedy, John F. 68–9 Kennedy, Robert F. 69 Keppel, Frances 79 Kolen, Michael 221, 281 Kuder, G. Frederic xxiv, 173–5, 212 Kuder-Richardson Formula 20 (KR-20) 174 Kuder-Richardson Formula 21 (KR-21) 174 Laplace, Pierre-Simon 294, 297 Larry P. v. Riles 89, 127, 139–40 latent distance model 280 latent trait theory 233 Latin Grammar School 4 Lau remedies 93 Lau v. Nichols 69, 93 law of comparative judgment 23–4 Lawley, Derrick 233–4 Lazarsfeld, Paul xxiv, 265 Leadership Conference on Civil and Human Rights, The 148 learning disabilities (LD) 34, 71, 88, 103–4 Lee, J. Murray 45, 62 limited English proficiency 70, 74, 89, 93, 95, 107, 108 Lindquist, Everett F. 27, 40, 46, 205, 212, 229, 266, 278, 282–5, 287n14, 288, 289, 323, 332, 333, 341,342 linguistic modification 99–102, 109

Index 365

linking 79, 248, 249, 250G, 256n7, 259, 261, 281, 289, 319–21, 323, 324, 238, 333, 338, 341, 342 linking function 248, 320, 321 Linn, Robert 80, 81, 90, 117, 118, 130, 136, 145, 197, 219, 220, 339 Lippmann, Walter 138, 153, 176, 179 literacy tests 124, 125, 128, 132, 133 Livingston, Samuel 55, 57, 62, 64 local education agency (LEA) 70, 109 Locke, John 7 Lombroso, Cesare 18 Loevinger, Jane 187–9, 200, 204, 343, 346, 358 LOGIST 245, 261, 262 Lord, Frederic M. xxiv, 79, 80, 157, 174, 175, 184, 185, 213, 214, 221, 233–5, 237, 241–3, 245, 247–9, 251–4, 256n2, 256n3, 256n5, 257, 264, 265, 266, 280, 281, 284, 285, 300, 303, 319, 325–31, 334–7 magic lantern 3 Magnusson, Warren 48, 62 Mann, Horace xxiii, 8–10, 13, 18–20, Margenau, Henry 265 marginal maximum likelihood estimation 241, 243, 244, 257, 261, 312 Markov chain Monte Carlo 244, 292, 306, 314–16 Massachusetts State Board of Education 9 Masters, Geofferey 236, 240, 242, 246, 260, 349, 351, 353, 354, 358, 360 master’s whip 7 maximum likelihood estimation 180, 234, 237, 241, 243, 257, 258, 259, 289, 300, 314, 326, 350 measurement by fiat 268–70 Meehl. Paul 50, 61, 185, 186, 188, 190, 191, 193, 194, 200, 203, 268, 290 Mémoire sur la Probabilité des Causes par les Evènements 294, 314 Mendel, Gregor 160, 166,179 Messick, Samuel xxiv, 30, 37n7, 50, 79, 113, 141, 184, 186, 188–97, 200, 201, 205, 346 method of constant stimulus 271 mGENOVA 221, 225 Millman, Jason 20, 53, 55, 57, 63, 217 Mislevy, Robert 80, 172, 176, 176n1, 200, 201, 241, 245, 254, 293, 297, 299, 302, 302, 303–5, 307–11, 311n4, 311n7, 339 missing data 81, 305, 315, 316

Morton, Samuel 18 Moss, Pamela 192, 193, 196, 197, 205 Mosteller, Frederick 265 MSCALE 245, 262 multidimensional random coefficients multinomial logits model 354, 357 MULTILOG 245, 261 multinomial logit model 239, 245, 257 multiple-choice model 240 multistage testing 248, 252–5, 259 multitrait-multimethod analyses 187, 202 Muraki, Eiji 223, 224, 240, 244, 257, 260 Nader, Ralph 144, 153 Nairn, Allan 144, 153 Nanda, Harinder 175,177, 203, 206, 215, 227 Nation at Risk, A 73, 86 National Academy of Sciences 135 National Advisory Council 69 National Assessment Governing Board (NAGB) xxv, 72, 73, 80, 81, 84, 85, 86, 225 National Assessment of Educational Progress (NAEP) 40, 72, 73, 75, 77–84, 87, 97, 99, 104, 107, 108, 305, 314, 315 National Association for the Advancement of Colored People (NAACP) 145, 148, 151 National Center and State Collaborative (NCSC) 75 National Center for Education Statistics (NCES) 32, 40, 80, 82, 87, 88, 99, 103, 108, 109, 144, 153, 225 National Center for Fair & Open Testing 146 National Computer Systems (NCS) 79 National Council of La Raza 106, 148 National Council of Teachers of Mathematics (NCTM) 96, 108 National Council on Measurement in Education (NCME) xxvi, xxvii, 38, 83, 112, 114, 120, 131, 150, 151, 187, 188, 191, 192, 199, 202, 217, 218, 318, 319, 337, 338, 340, 346, 357 National Defense Education Act (NDEA) 27, 40, 68, 69, 85 National Disability Rights Network 148 National Institute of Education 72 National Intelligence Tests 67, 86, 137–9, 150, 154 National Research Council (NRC) 66, 67, 86, 137, 140, 145 National Urban League 148

366 Index

Nation's Report Card xxv, 78, 81, 82, 86, 97, 108 Natural Inheritance 161, 166, 178, 289 New England Primer 5, 6 New Republic, The 138, 153, 154, 179 New York Regents Exam 16–18 New York Times 10, 20, 39, 40, 139, 153 Newton-Raphson estimator 241, 242, 244 Neyman, Jerzy 243, 260, 295, 359 Nitko, Anthony 53, 54, 60, 61, 63 No Child Left Behind (NCLB) 31, 40, 45, 74, 85, 94, 99, 105, 106, 110, 147 nominal response model 239 nomothetic span 190, 203 None of the Above: Behind the Myth of Scholastic Aptitude 84, 145, 153 nonequivalent anchor test design 322 norm 42–53, 56, 60, 73, 91, 97, 285 norm referenced test (NRT) 42–4, 47, 48 normal curve 48 normal distribution 48, 51, 272–7, 286, 287n8 normal ogive 276, normal-ogive model 233, 234, 235, 237, 238, 238, 280, 311, 317 normative scale interpretations 276 nose pinchers 7 norming 46, 48, 91, 97, 281, 289, 290, 337, 342 Novick, Melvin 55, 57, 80, 175, 184, 214, 237, 241, 285, 293, 298, 299, 301–3, 309, 311n5 objective measurement 236, 247, 348, 351, 353, 357, 358, 360, observed score equating 335 Old Deluder Satan Act 4 On Statistics by Intercomparison 275 On the Origin of Species 83, 158, 176n2, 178 opportunity to learn (OTL) 74, 115, 116 opt out 37, 38,72, 148, 151 ordinal scale 48, 269, Organization for Economic Cooperation and Development (OECD) 354, 359 Otis, Arthur 44, 67, 84, 320, 321, 323, 328, 342 Oxford 13, 162, partial credit model 240, 354 Partnership for Assessment of Readiness for College and Career (PARCC) 33, 75–7 Peabody Picture Vocabulary Test 48

Pearson, Karl 50, 162, 165–73, 176, 177n5, 179, 180, 266, 297, 316 Peckham, Robert 140 Penn, William 7, 15 percentile 276, 321 percentile rank 47, 55, 328 phrenology 18, 21 Planning the Objective Test 323, 342 Popham, William James 52, 54–7, 61, 63, 197, 205, 315 predictive bias 117 presmoothing 334, 341 Price, Richard 294 Princeton University 46, 138, 233 prior distribution 141, 257, 293, 294, 296–8, 301, 311, 312 Probabilistic Models for Some Intelligence and Attainment tests 236, 280, 343, 353 probit model 234, 241 proficiency 22, 32, 35, 40, 70, 72, 73, 80, 89, 90, 93–6, 98, 101–5, 107, 108, 117, 163, 164, 168, 248, 250, 253, 254, 291, 295, 305, 307 Programme for International Student Assessment (PISA) 354, 359 Proof and Measurement of Association between Two Things, The 169, 179, 359 Psychological Corporation 67, 86 Psychological Review 168 Psychometrika 174, 233, 234, 295 psychophysics xxiv, 178, 264, 270, 273, 274, 284, 286, 288, 290 Public Law 39–73 (PL 39–73) 78 Public 85–864 (PL 85–864) 68 Public Law 89–10 (PL 89–10) 65, 69 Public Law 89–750 (PL 89–750) 69, 71 Public Law 90–247 (PL 94–247) 69 Public Law 94–142 (PL 94–142) 71 Public Law 95–561 (PL 95–561) 71 Public Law 97–35 (PL 97–35) 72–73 Public Law 100–297 (PL 100–297) 73, 80 Public Law 103–227 (PL 103–227) 73 Public Law 103–382 (PL 103–382) 73 Public Law 107–110 (PL 107–110) 74, 77 Public Law 110–195 (PL 114–195) 76, 77 Public Law 114–195 (PL 114–195) 76–77 Puritans 3, 5 Quakers 3 quantitative judgments 274 Quetelet, Adolphe 275

Index 367

Rajaratnam, Nageswari 175, 177, 203, 205, 215, 227, 228, 230 random coefficients multinomial logits model 354, 357 random effects 168, 208, 211, 212, 345, 354, random groups design 322, 325–7, 329–32 randomly parallel 175, 209, 214, Rasch measurement theory 343, 344, 344, 345, 346, 347, 349–51, 351, 352, 253, 354–8 Rasch model 235, 236, 240–3, 245, 246, 280, 287n9, 287n11, 334, 344, 346, 349, 350, 351, 351, 352, 352, 353, 354, 355 Rasch, Georg xxiv, 235, 236, 240, 241, 266, 280, 343, 345, 347–9, 351, 353, 354–6 rating-scale model 240, 257, ratio scale 269, 273 Raven's Coloured Progressive Matrices 90, 108 Raven's Intelligence Test 89 Ravitch, Diane 136, 148, 153 Ray, Rolland 332, 333 read-aloud 102, 110 Reagan, Ronald 72 Recitation xxiii, 5, 7, 9, 17–8 Record of Family Faculties 160, 161, 178 regression 160–2, 166, 172, 179, 180, 234, 299, 302, 313, 320–3, 327, 328, 330, 331, 333, 357 Rehabilitation Act of 1973 34, 40 Reign of ETS: The Corporation that Makes Up Minds, The 144, 153 relative error 208, 210, 213, 214, 216 relative error variance 210, 216 reliability 44, 49, 50 54, 55, 57, 60, 64, 97, 115, 117, 168–75, 177–80, 183, 185, 200, 202, 206, 210–16, 218, 224–9, 231, 247, 282, 298, 299, 301, 303, 313, 315, 319, 323, 324, 328, 337, 340–2, 345, 360 Reliability and Validity of Tests, The 64, 73, 180, 360 reliability, split-half 170, 174, 175 reliability, test-retest 55, 170 174, 229 reproductive rights 123 Research Triangle Institute 79 response model scaling 265 Revolutionary War 4, 15 Rice, Joseph M. 44, 63, Richardson, Marion xxiv, 157, 173–5, 179, 212, 229 RMC Research Corporation 70, 84 Roid, Gale H. 56, 63

Rousseau, Jean-Jaques 7 Rubin, Donald 40, 241, 244, 258, 302, 305, 316, 319, 336, 340–2 Samejima, Fumiko 238–40, 243, 247, 251, 260, 300, 316 SAT 22–4, 25, 26–36, 38–41, 46, 47, 51, 135, 136, 139, 142–7, 151n3, 151–4, 325, 326, 234, 340, scaling xxiv, 50, 176, 233, 236, 238, 244, 250, 258, 259, 261–5, 266, 267, 269–71, 273–86, 286n2, 287n6, 287n19, 287n11, 287n17, 288–91, 337–9, 341, 342, 345, 345, 346, 356, 358, 359 school exhibitions 8–9 schoolmasters (school masters) 6–7, 9 score comparability 337 score distribution 49, 237, 333 Shavelson, Richard 217–20, 224 Simon, Theodore 19, 43, 61, 83, 132, 138, 176, 177 single group design 322, 326, 329 Smarter Balanced Assessment Consortium (SBAC) 33, 75, 105 Smith v. Regents of the University of California 146 smoothing 321, 325, 329, 331–4, 338–41 socioeconomic status (SES) 91, 97, 105, 107, 108, 143, 144 Spearman, Charles xxiii, xxiv, 157, 162, 167–73, 176, 177n7, 177n9, 178–80, 182, 205, 346, 359 Spearman-Brown formula 170, 172, 174, 178, 180 specific objectivity 236, 344, 347, 349, 350, 355, 356, 359 Sputnik I 68, 77 standard setting 61–3, 73, 80, 83, 86, 216, 222, 223, 225, 226, 231 standardized examination xxiii, 9, 13, 19, 44–5, 50–1, 68, 91, 112, 114, 117, 126, 129–30, 136–7, 139–40, 144–8, 151 Standards for Educational and Psychological Testing Stanford Achievement Tests 46 Stanford-Binet 43, 44, 46, 88, 122, 126, 137, 139, 290 Starch, Daniel 45, 63 state education agency (SEA) 36, 71 Statistical Methods for Research Workers 178, 227, 294, 313 Statistical Theories of Mental Test Scores 83, 179, 204, 229, 237, 257, 259, 288, 289

368 Index

Steinberg, Lynne 200, 205, 233, 240, 245, 261, 311 Stern, William 43, 63, Stigler, Stephen 159, 162, 165, 180, 271, 277, 290, 294, 314, 317 stimulus-based scaling 265 Straight Talk About Mental Tests 139, 153 stratified alpha 215, 218 Students with disabilities 22, 34, 35, 74–6, 98, 105–7, 109, 117, 150, subject-based scaling 265 Subkoviak, Michael 55, 64 sufficient statistic 243, 246 SUNY Stony Brook 215 Swaminathan, Hariharan 54, 55, 62, 64, 233, 244, 258–61, 303, 309, 317 sweet peas 160, 176 Syms, Benjamin 4, 19 Syms-Eaton Academy 4, 19 T score 51 table of content specifications 218 Taylor, Frederick W. 66, 84 Terman, Lewis 43, 44, 46, 62, 64, 66, 67, 83, 84, 88, 121–3, 134, 137, 138, 153, 154, 166, 278, 282, 290 TerraNova 46, 61 test development 31, 45, 105, 119, 130, 252, 259, 287n11, 318, 321–4, 338, 339 Test Equating, Scaling and Linking 259, 281, 289, 341 test information function 247, 251 Test of English as a Foreign Language (TOEFL) 29, 35, 202 test preparation 16, 22, 28, 30, 37, 143, 145, 148, 149 test security 28–30 testlet model 239 Texas Instruments 3 Théorie analytique des probabilités 294, 314 Theory and Methods of Scaling 266, 291 Theory of Mental Tests 173, 178, 203, 228, 289, 341, 358 Theory of Probability 295, 313 Thissen, David 233, 240, 244–6, 257, 258, 260, 261, 281, 289 Thorndike, Edward 44, 45, 64, 66, 84, 137, 168, 172, 205, 258, 277, 282, 287n5, 290, 320, 321, 323, 328, 342 Thorndike, Robert 60–2, 83, 202, 288, 323–5, 328, 338, 340, 342 Thurstone, Louis Leon 49, 50, 64, 180, 233, 258, 261, 264, 265, 255, 273, 274,

277, 278, 279, 286, 287n5, 291, 299, 317, 346, 349, 359, 360 Title I Evaluation and Reporting System (TIERS) 70, 84 Title I Technical Assistance Centers 77, 78, 82 Torgerson, Warren 264, 265, 266, 267, 268, 273, 280, 281, 285, 287n6, 287n10, 291 trial state assessments 73 true score equating 335 truth in testing 22, 28–30, 37n5, 39, 40, 145 Tucker, Ledyard 233, 261, 266, 280, 285, 291, 326, 327, 330, 331, 337, 341 Tukey, John 212, 226, 266, 268, 280, 285, 290, 297, 329, 333, 334, 340, 350, 358 Tyler, Ralph 79 U. S. Department of Education (USED) 32, 35, 78, 78, 84, 87, 92, 94, 96, 100, 103, 109, 147, 154 U. S. Department of Health, Education, and Welfare (HEW) 67, 68, 84 US News & World Report 139 U. S. Office of Education (USOE) 67, 68, 70, 77, 78, 79, 82 unidimensionality 214, 250 universe of admissible observations 208, 209, 211, 219 universe of generalization (UG) 209, 210, 214, 218 University College London 165, 166, 169, 171, 179 University of California 139, 143, 145–7 University of Chicago 147, 212, 236, 245, 256n1, 351, 355 University of Illinois 151, 216 University of Iowa 45, 282, 332, 339 University of Leipzig 42, 270 University of North Carolina-Chapel Hill 239 Upward Bound 27 urGENOVA 221, 224, 225 validation 50, 56, 60, 62, 83, 92, 114, 118, 181, 183, 185–94, 196–205, 289, 359, 360 validity xxiv, xxv, 28, 31, 34, 36, 37, 39–43, 50, 52, 56, 61–4, 76, 87, 88, 97, 99–101, 105,110, 113, 115–20, 132–5, 150, 152, 173, 180–206, 212, 218, 228, 229, 265, 261, 274, 282, 283, 287n15, 289, 313, 346, 359, 360,

Index 369

validity, argument-based framework 37, 40, 50, 181, 197–204, 287n15, 289 validity, concurrent 188 validity, consequential 63, 181, 184, 189, 194, 195–9, 201, 197, 204, 205 validity, construct 50, 61, 62, 181, 185–98, 200–3, 205 validity, content 52, 187, 188, 203, 205 validity, criterion 50, 118, 181–8, 190, 191, 193–6, 199–201, 216 validity, predictive 31, 36, 40, 60, 76, 117–20, 133, 134, 152 validity, unified model 181, 186–188, 190–5, 197, 199–201 value-added model (VAM) 149–151 van der Linden, Wim 233, 241, 247, 248, 252, 253, 257, 258, 260, 261, 304, 313, 317, 343, 360 variance components 175, 208–12, 215–17, 221, 223, 224, 227, 228, 230, 245 Varsity Blues scandal 147 Vineland Training School 122 Volkmann, John 265 Voting Rights Act 124, 134 waiver 32 Wald, Abraham 295 Wallace, Alfred 157, 158

Wallace, David 297 Webb, Noreen 217, 218, 220, 224 Weber, Ernst 264, 270, 176n2, 180, Wechsler Intelligence Scale for Children (WISC) 126 Wechsler, David 44, 64, 67, 86, 142, Wells, Frederick L. 66 Westat 79 Whipple, G. M. 44, 64, 66, 67, 86, 137, 154 whispering stick 7 WINSTEPS 243, 245, 259, 262, 353, 355, 359 Wissler, Clark 43, 64, 167, 168, 177n6, 180 Wood, Benjamin D. 45, 64, 172, 180, World Book 45, 67, Wright, Benjamin 236, 240, 242, 245, 246, 248, 260, 262, 281, 291, 334, 342–4, 347–51, 353, 355, 357, 359, 360 Wundt, Wilhelm 42, 43, 69 Yen, Wendy 81, 86, 246, 262 Yen’s Q1 246 Yen’s Q3 246 Yerkes, Robert M. 44, 64–6, 82, 84, 122, 123, 128, 129, 135, 137, 319, 342 Yule, G. Udny 168, 180