206 28 4MB
English Pages 206 [208] Year 2020
VALIDITY Validity is a clear, substantive introduction to the two most fundamental aspects of defensible testing practice: understanding test score meaning and justifying test score use. Driven by evidence-based and consensus-grounded measurement theory, principles, and terminology, this book addresses the most common questions of applied validation, the quality of test information, and the usefulness of test results. Concise yet comprehensive, this volume’s integrated framework is ideal for graduate courses on assessment, testing, psychometrics, and research methods as well as for credentialing organizations, licensure and certification entities, education agencies, and test publishers. Gregory J. Cizek is Guy B. Phillips Distinguished Professor of Educational Measurement and Evaluation at the University of North Carolina at Chapel Hill, USA.
VALIDITY An Integrated Approach to Test Score Meaning and Use Gregory J. Cizek
First published 2020 by Routledge 52 Vanderbilt Avenue, New York, NY 10017 and by Routledge 2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN Routledge is an imprint of the Taylor & Francis Group, an informa business © 2020 Taylor & Francis The right of Gregory J. Cizek to be identified as author of this work has been asserted by him in accordance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this book may be reprinted or reproduced or utilised in any form or by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying and recording, or in any information storage or retrieval system, without permission in writing from the publishers. Trademark notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging-in-Publication Data A catalog record for this title has been requested ISBN: 978-0-367-26137-5 (hbk) ISBN: 978-0-367-26138-2 (pbk) ISBN: 978-0-429-29166-1 (ebk) Typeset in Baskerville by Newgen Publishing UK
TO STEPHEN FRANCIS GREGORY CIZEK (JULY 7, 1987 – OCTOBER 19, 2019). YOU ARE MY TREASURED SON.
COPYRIGHT ACKNOWLEDGMENT
Just Can’t Get Enough Words and Music by Will Adams, Allan Pineda, Jaime Gomez, Stacy Ferguson, Jabbar Stevens, Julie Frost, Thomas Brown, Joshua Alvarez, Rodney Jerkins and Stephen Shadowen Copyright © 2010 BMG Sapphire Songs, i.am.composing, llc, BMG Platinum Songs US, apl.de.ap.publishing llc, BMG Gold Songs, Headphone Junkie Publishing LLC, EMI April Music Inc., Kid Ego, Darkchild Songs, Totally Famous Music, TBHits, Tuneclique Music, Native Boy Music, The Publishing Designee Of Stephen Shadowen and Rodney Jerkins Productions, Inc. All Rights for BMG Sapphire Songs, i.am.composing, llc, BMG Platinum Songs US, apl.de.ap.publishing llc, BMG Gold Songs and Headphone Junkie Publishing LLC Administered by BMG Rights Management (US) LLC All Rights for EMI April Music Inc., Kid Ego, Darkchild Songs, Totally Famous Music and TBHits Administered by Sony/ATV Music Publishing LLC, 424 Church Street, Suite 1200, Nashville, TN 37219 All Rights Reserved. Used by Permission. Reprinted by Permission of Hal Leonard LLC
CONTENTS
Preface 1 Introduction Foundational Measurement Concepts Underlying Validity What Is a Test? Inference: Always Required, Sometimes Risky Constructs: The Objects of Social Science Research and Development Tests, Inferences, and Constructs Validity Defined Validity Concerns Intended Score Meaning Validity Is a Property Validity Does Not Define a Construct Validity of Intended Inferences and Attention to Test Score Use Validity and Validation Why Is Validity So Important? Summary 2 Validity: The Consensus The Long Tradition of Professional Standards Related to Validity Areas of Consensus in Modern Validity Theory Principle 1: Validity Pertains to Test Score Inferences Principle 2: Validity Is Not a Characteristic of an Instrument Principle 3: Validity Is a Unitary Concept Principle 4: Validity Is a Matter of Degree Principle 5: Validation Involves Gathering and Evaluating Evidence Bearing on Intended Test Score Inferences Principle 6: Validation Is an Ongoing Endeavor Sources of Evidence for the Meaning of Test Scores Evidence Based on Test Content Evidence Based on Internal Structure Evidence Based on Response Processes
Evidence Based on Relationships to Other Variables Conclusions 3 Validity and the Controversy of Consequences Roots of the Problem of Consequences in Validity Theory What Is Consequential Validity Anyway? Three Conceptual Problems with Consequential Validity The Definitional Flaw The Temporal Flaw The Causal Flaw Three Practical Problems with Consequential Validity The Problem of Delimitation The Problem of Practice The Problem of Location The Most Fundamental Problem with Consequential Validity Can Consequences of Test Use Ever Provide Validity Evidence? Conclusions and the Harm in the Status Quo 4 A Comprehensive Framework for Defensible Testing Part I: Validating the Intended Meaning of Test Scores A Comprehensive Framework for Defensible Testing Foundations of Validating Intended Test Score Inferences Validity and the Validation Process Major Threats to Test Score Meaning Construct-Relevant Variation and Construct-Irrelevant Variation Construct Misspecification and Construct Underrepresentation Reconsidering Sources of Validity Evidence Relationships among Variables Test Development and Administration Procedures A Revised Menu of Sources Summary and Conclusions 5 A Comprehensive Framework for Defensible Testing Part II: Justifying the Intended Uses of Test Scores Purposes of Testing Justification of Test Score Use A Foundation for Justifications of Intended Test Score Use
Sources of Evidence for Justifying an Intended Test Score Use Evidence Based on Consequences of Testing Evidence Based on Costs of Testing Evidence Based on Alternatives to Testing Evidence Based on Fairness in Testing Summary of Sources of Evidence for Justification of Test Use Comparing Validation and Justification Dimension 1: Rationale Dimension 2: Timing Dimension 3: Focus Dimensions 4 and 5: Traditions and Warrants Dimension 6: Duration Dimension 7: Responsibility Critical Commonalities Some Common Threats to Confident Use of Test Scores The Comprehensive Framework for Defensible Testing A Note on Consequences of Test Use as a Source of Validity Evidence Conclusions 6 How Much Is Enough? How Much Evidence is Enough to Support an Intended Test Score Meaning? The Purposes of Testing Quantity vs. Quantity Quantity vs. Quality Resources Burden Conclusions about Validation How Much Evidence is Enough to Support an Intended Test Use? Validity Evidence Resources and Burden Input Need Consequences of Use Commonalities and Conclusions 7 Conclusions and Future Directions
A Comprehensive Approach to Defensible Testing The Benefits of a Comprehensive Approach Future Research and Development in Validity Theory and Practice Justification of Test Use Validity and Classroom Assessment Group vs. Individual Validity Validation, Justification, and Theories of Action Toward a Truly Comprehensive Framework References Index
PREFACE
My validity journey has been a long one. I first encountered the notion of validity—and the high esteem it was afforded—nearly 40 years ago. I was an undergraduate student taking my last elective course in my final semester of preparation to become an elementary school teacher. The instructor was Professor Robert Ebel, whose stature in the field of measurement I did not apprehend at the time. After spending several years as a fourth grade teacher, my interests focused more clearly in the area of assessment, and I pursued a graduate degree in that area. Eventually, I came to see psychometrics, at its core, as a field concerned with data quality control. That is, the fundamental aim of psychometricians was to help ensure that the data collected via any measurement procedure could be trusted as dependable, and that the scores produced would have the meaning they were intended (or assumed) to convey to users of information yielded by tests. I have found that aim to be consistent across the diverse contexts I experienced throughout my career: elementary school classroom assessments, local school board academic program decision making, high-stakes licensure and certification examinations, guidance tools used by school counselors, and statewide student achievement testing programs. Having now spent nearly 30 years in a university setting, I often have the privilege of directing students’ dissertation research projects. In those contexts and in the courses I teach, I try to pass along the high value and importance of validity, encouraging my colleagues to always interrogate “whether the data they obtain so cleverly and analyze so complexly are any good in the first place” (Cone & Foster, 1991, p. 653). Over the course of those decades, however, what seemed to be so clear in importance seemed to be so muddied in understanding—even among specialists in the field of measurement. In my own scholarly work in the area, I first became concerned by what I perceived to be a flaw in validity theory regarding the place of consequences of testing. I was not the first to notice that error; indeed, I found that many others (e.g., Mehrens, 1997; Popham, 1997) had pointed out the error much earlier. The first steps in my validity journey focused somewhat narrowly on the error of including consequences of testing as a source of evidence supporting the validity of test scores. That initial focus on consequences, however, proved to be fortuitous because it began to illuminate broader concerns. I came to realize that the problem of how to deal with the (appropriate) concern about consequences of testing was a comparatively minor aspect of
two much more serious problems. The first problem was that incompatible aspects of testing (test score meaning and test score use) had been incorporated into a single concept—validity—where they had a predictably unsettled coexistence. The second concern was the lack of a comprehensive framework for defensible testing that clearly differentiated the two most important questions in educational and psychological measurement: (1) “What is the evidence that this test score has the meaning it is intended to have?”; and (2) “What is the evidence that this test score should be used as it is intended to be used?” As regards this second concern, it is worth noting that whereas professionally accepted evidentiary sources, procedures, and best practices for answering the first question (i.e., evaluating the intended meaning of test scores) have existed for more than a half-century, no similarly mature traditions or proffered guidelines existed for answering the second question. This state of affairs (i.e., the fundamental incompatibility of combining evidence regarding test score meaning and test score use into a single concept, and the lack of a comprehensive framework for dealing with those concerns) have, I believe, frequently— though understandably—contributed to anemic validation efforts, and to often weakly supported or indefensible testing policy initiatives. Regarding the lack of alacrity in validation efforts, it was perhaps predictable from the outset that the fundamental question of how—or even whether—to incorporate consequences into validity investigations and the more general failure to recognize the distinct evidentiary sources and methods for investigating score meaning and score use would prove to be a nettlesome and lingering concern. It was Popham who first observed that “cram[ming] social consequences where they don’t go—namely, in determining whether a test-based inference about an examinee’s status is valid” (1997, p. 9)—was ill-conceived. The certain effect on validation practice was aptly summarized by Borsboom, Mellenbergh, and van Heerden who noted that: Validity theory has gradually come to treat every important test-related issue as relevant to the validity concept … In doing so, however, the theory fails to serve either the theoretically oriented psychologist or the practically inclined tester … A theory of validity that leaves one with the feeling that every single concern about psychological testing is relevant, important, and should be addressed in psychological testing cannot offer a sense of direction. (2004, p. 1061) As my journey continued, a feature of extant constructions of validity theory itself became clear. A concept as essential as validity should have been articulated in the most widely accessible manner. Instead, however, it was often explicated in an unnecessarily abstruse
fashion, tortured in never-resolved philosophical ruminations regarding the nature of phenomena and knowing in psychological assessment, with the result that validation devolved into an arcane academic endeavor contemplated by connoisseurs rather than an essential exercise engaged in with enthusiasm by practitioners. In his own chapter on validation, Kane has critiqued validity theory as being “quite abstract” (2006a, p. 17). Perhaps the clearest example of both problems—conflation of meaning and use, and inaccessible formulations—can be found in the otherwise excellent work of Samuel Messick. His tour de force on validity is a 91-page chapter on the topic found in the most respected professional reference work in modern testing: Educational Measurement, third edition (Linn, 1989). In his chapter, Messick makes the impossible demand that all sources of evidence—that is, evidence regarding score meaning and evidence regarding test score use— must be synthesized and evaluated as a whole, arriving at a singular conclusion about validity. This is seen clearly in Messick’s very definition of validity, which states: Validity is an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment. (1989, p. 13) The first problem that flows from this construction is that coming to a single, evaluative judgment based on integration of evidence regarding meaning (i.e., inferences) and use (i.e., actions) is impossible. Regarding the explication of validity itself, in the same chapter, Messick devotes 13 pages of his introduction to validity to an excursion into “philosophical conceits.” Had any of the conscientious testing specialists envisioned by Borsboom et al. dipped a ladle into Messick’s chapter hoping for practical guidance on how to engage in good-faith efforts to support their endeavors, they almost certainly left thirsty. The predictable consequence of the noticeable contrast between the academic scholarship on validity proffered by theorists and the actual work being done by testing specialists to provide support for the meaning of test scores is unmistakable. As Brennan has observed, “validity theory is rich, but the practice of validation is often impoverished” (2006, p. 8). Borsboom, Mellenbergh, and van Heerden (2004) have also noted that “the concept that validity theorists are concerned with seems strangely divorced from the concept that working researchers have in mind when posing the question of validity” (p. 1061). Validity theory and practice have never recovered. Not only has validity theory diverged from and dampened validation practice, but the persistent conflation of the very different concerns about test score meaning and use that had its roots in Messick’s writings has also had regrettable consequences for the field of measurement itself. It is an embarrassing state
of affairs that there is substantial disagreement among measurement specialists about the concept deemed to be the most fundamental in that field (Salamanca, 2017). These problems brought me to a place in my validity journey where I realized that my concern about the place of consequences in discussions of validity was merely a tangential aspect of a larger issue, and that a wholesale reconceptualization of validity was in order (see Cizek, 2012; 2016). What was needed was a fundamentally expanded, coherent framework that provided parallel and straightforward sources of guidance toward the goal of defensible testing. That guidance would comprise: (1) potential sources and procedures for gathering evidence to support a claim that a score resulting from a measurement procedure can be trusted to signify what it is intended to (i.e., validity); and (2) potential sources and procedures for gathering evidence to support a claim that a score resulting from a measurement procedure can justifiably be used in some proposed manner. Thus, my journey has led me to this point and this purpose: to develop a unified and comprehensive framework for defensible testing. As to the first—necessary, but insufficient —endeavor (i.e., gathering support for an intended score meaning), there are wellestablished, longstanding, and respected traditions—primarily psychometric—for identifying and synthesizing that evidence. Those traditions are perhaps most clearly set forth in the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 2014) which are the most recent edition of technical guidance that can be traced back to the Technical Recommendations for Psychological Tests and Diagnostic Techniques, first published by the American Psychological Association in 1954. Although some minor modifications are needed to the standard guidance regarding potential sources of evidence supporting intended score meaning, there is broad consensus, rooted in the Standards, about those sources. That consensus and some suggestions for modifications to the consensus, and the first part of a framework for defensible testing, are put forth in Chapters 2 and 4 of this book, respectively. There are, however, no corresponding traditions and standards for potential sources of evidence supporting intended score use(s). Thus, this volume describes a menu of options for where such evidence might be found, rooted largely in the traditions of program evaluation. Chapter 5 presents the second part of a framework for defensible testing and the complete model. The other chapters provide context for that model and implications concerning its application. Chapter 1 comprises a review of some foundational measurement concepts underlying validity, including defining what a test is and the role of constructs and inferences in testing. Chapter 1 also defines validity and provides a rationale for why concern about validity is so important. Chapter 3 addresses a lingering controversy in validity theory—the idea that “consequences of testing” is a source of validity evidence. Beyond demonstrating the error of that notion, the chapter provides support for how consequences of testing must be considered in a truly comprehensive approach to defensible testing.
In terms of both the evidence needed to support the validity of an intended test score interpretation and the evidence needed to support an intended test use, Chapter 6 addresses the vexing and omnipresent question of “How much evidence is enough?” Finally, Chapter 7 considers some overall conclusions and suggests next steps for the future of theory, research, and practice regarding test score validation and test use justification. Each of the chapters in this volume follows a stylistic convention that is unconventional. That is, much of the work on validity theory and practice has delved routinely into metaphysical contemplations of the nature of knowledge, reality, and truth. Such work is valuable. However, the focus of the present treatment of validity is squarely pragmatic—an attempt to foster improved practice in gathering evidence to support the intended meaning of test scores and to heighten attention to the need for gathering evidence to support the intended uses of test scores. In addition to laying out the basic aims of this volume, it seems appropriate first to also set out some of what this book does not intend to accomplish. Namely, although the philosophical underpinnings of something as important as validity are surely worthy of examination, those underpinnings are not the focus of this volume. Although intending to provide a solid grounding in validity theory, this book anticipates an audience of rigorous scholars and practitioners who seek a meaningful introduction to modern validity theory and the practice of validation, but who have no wish to be philosophers of science. That is not to say that the concepts treated in these pages are ascientific, atheoretical, or loosed of any moorings in the nature of knowledge, just that those moorings are intentionally not highlighted. The primary intent of this book is to provide readers with a comprehensive, actionable framework for sound assessment development and evaluation. The central purposes are two-fold: (1) to provide researchers, psychometricians, psychologists, survey methodologists, graduate students, and those who oversee credentialing, achievement, or other testing programs with a complete, accessible and up-to-date introduction to modern validity theory; and (2) to equip practitioners concerned about defensible testing with the tools necessary to go about the good work of validating score meaning and justifying score use. In closing this preface, I am compelled to acknowledge three groups of people. First, I am indebted to those who have engaged in the great work on validity that has preceded this book. As readers will notice on skimming the References section, the work here builds on the work of numerous scholars, many of whom have spent their entire career advancing validity theory and practice. That work has not been conducted as an end in itself, but as an effort focused on providing more accurate, meaningful, unbiased, and actionable information to a variety of stakeholders, including students, teachers, program directors, candidates for licensure or certification, legislators, policy makers, and the general public. Among the most influential validity theorists is Samuel Messick. At various junctures in
this book, I express strong viewpoints regarding what I regard as flaws in modern validity theory, often citing Messick’s work; doing so often made me feel as if I were tugging on Superman’s cape. To be clear, Messick did not invent the term consequential validity, although his work supported the mixing of issues related to test score meaning and test score use that I believe has had lingering negative effects. However, despite what I view as problems with some of Messick’s formulations, his contributions to validity theory are without parallel. This book surely will not add as much to validity theory and practice as Messick has and he still stands as one of the preeminent and respected scholars in our field. Second, I have been fortunate to collaborate with numerous professional colleagues who have provided opportunities to pilot the ideas in this book, offered the input of a critical friend, freely given of their expertise, or encouraged my continuing journey. Among the many such important influences have been Dr. Michael Bunch of Measurement Inc., Dr. Wayne Camara of ACT, Inc., Professor Gregory Camilli of the Law School Admissions Council, Professor Edward Haertel of Stanford University, Professor Ronald Hambleton of the University of Massachusetts-Amhert, Professor Peter Halpin of the University of North Carolina at Chapel Hill, Professor H. D. Hoover of the University of Iowa, Dr. Michael Kane of ETS, Professor Robert Lissitz of the University of Maryland, Professor William Mehrens of Michigan State University, and Dr. Jon Twing of Pearson. All of the sage input of these scholars notwithstanding, any errors in conceptualization and expression are solely my own. I also appreciate the support for this work provided by the School of Education at the University of North Carolina at Chapel Hill, and the encouragement of Dean Fouad Abd-ElKhalick to pursue this work with support from the School of Education research leave program. I am indebted to the publisher of this book, Routledge, which has a long and successful history of publishing important works in the social sciences. I must particularly recognize Daniel Schwartz for his advice and enthusiasm for this project. Finally, I am grateful for the continuing support of my wife, Julie, who I join in thanking God for showing undeserved goodness to us each day.
1 INTRODUCTION
Validity has long been one of the major deities in the pantheon of the psychometrician. (Ebel, 1961, p. 640)
Any treatment of the topic of validity must first acknowledge the central importance of the concept to any measurement process developed or used in the social sciences. As many authoritative sources have asserted—and as is affirmed here—validity is the most important and essential characteristic of test scores. According to the Standards for Educational and Psychological Testing, “validity is … the most fundamental consideration in developing tests and evaluating tests” (2014, p. 11). And, the importance of validity is ongoing. Validity is not only pursued as a single time point endeavor: concern for validity should pervade the entire testing process—from when a test is first conceptualized to when test scores are reported. As a beginning point for appreciating the importance of validity, this chapter first reviews some key prerequisites that are necessary for its understanding. These ideas are broadly applicable and underlie the compelling need for attention to validity across highly diverse testing applications. To aid readers who may not already be familiar with some background concepts that are essential for understanding validity and fully engaging with the content of this book, key terms such as test, inference, construct, and assessment are first defined and illustrated. Of course, assertions about the preeminence of validity beg two questions that this chapter will also address: (1) “What is validity?” and (2) “Why is validity important?” Along those lines, the second aim of this chapter is to present a definition of validity that will serve as a reference point. The concept of validity will be presented in both technical and practical terms. Finally, with a definition of validity in place, the ways in which validity is related to other aspects of defensible testing practice will be examined. Accordingly, the third aim of this chapter is to briefly introduce a substantial reconceptualization of validity, nested in an overall framework for defensible testing. This reconceptualization will be more completely elaborated and illustrated in subsequent chapters, but is foreshadowed in this chapter to give the reader a sense of what lies ahead.
Foundational Measurement Concepts Underlying Validity In this section, foundational concepts necessary for fully understanding validity are presented, including test, inference, construct, and assessment in the social sciences. Examples of each of these concepts from diverse areas of the social sciences are provided.
What Is a Test? In the social sciences, a straightforward and broadly generalizable definition is that a test is a sample of information about some intended characteristic of persons that is gathered under specified, systematic conditions. However, the simplicity of this definition belies that fact that “test” is a frequently misunderstood concept on at least two counts. Tests as Samples First, it is important to realize that a test is a sample, and only a sample of a test taker’s knowledge, skill, ability, interest, or other attribute which cannot be directly observed and about which information is desired. It is often incorrectly concluded that a test score represents a highly definitive, concrete, or conclusive piece of information about a test taker. The fact that a test is only a sample of a test taker’s responses suggests otherwise. Although we might want to get as much information as possible about some characteristic, it is typically impossible or impractical to observe everything about a test taker. Indeed, the sample of information collected by a test may be very small. Tests typically capture only a small portion of what could be observed, so it is essential that the sample is one that is carefully structured. To illustrate this first principle of testing, it is useful to consider some extremes. It is perhaps obvious that it would not usually be of greatest interest—or a basis for awarding a medical license—whether a medical student could correctly respond to the following, specific, multiple-choice question: “Which medication should not be given to a child at risk of Reye’s Syndrome?” Instead, a medical board responsible for such a test would typically wish to extrapolate from the examinee’s correct response to this question (“aspirin”) and several others like it, to a larger domain of knowledge about contraindications for various drugs as one component for making a licensure decision. Likewise, it would not usually be of interest that a third grade student was name-calling on the playground during the first recess period last Tuesday. Instead, an educator trying to understand the extent of an elementary school’s bullying climate would typically wish to make several systematic observations across various students, grade levels, days, and contexts. A basketball scout would not want to offer a contract to a potential player having seen the player attempt a single free throw. And, a voter’s response to “Do you favor or oppose more metered, on-street parking?” would not usually be very helpful to a pollster in determining the person’s political philosophy.
It should be clear that in each of these situations, making a judgment about a medical student’s competence from a single question on aspirin, conclusions about a school’s bullying climate from a single recess observation, judgments about a player’s potential from single free throw attempt, or impressions of a voter’s political orientation based on his or her position on a single issue are likely to be both highly undependable and inaccurate. That is why the medical licensure examination samples more broadly, including several questions about various drugs that are most likely to be encountered in practice; it is why the educator performs a number of observations across various contexts where bullying is likely to be experienced in elementary schools; it is why sports teams view many examples of a potential player’s performances; and it is why the pollster’s focus groups would touch on a range of political and policy topics that commonly reflect differential political attitudes. In summary, as regards the first essential characteristic of a test, it should be recognized that even one question on drugs, a single recess observation, a single free-throw attempt, or a single interview about on-street parking is a test. They all qualify as tests because each one represents the collection of a sample—albeit a very small one—of the test taker’s knowledge, skill, attitude, and so on. However, it is probably also obvious that a test, though only a sample, provides the most accurate and dependable information about a test taker’s knowledge, skill, or attitude when it is more carefully and comprehensively constructed. Tests: Agnostic as to Format The second aspect of what constitutes a test is what a test is not. Referring to the definition of test provided earlier, it can be seen that its meaning is untethered to any specific format. Sometimes—but again incorrectly—it is believed that “test” connotes a collection of multiple-choice questions, bubble sheets, number 2 pencils, and a stopwatch. Although multiple-choice questions administered in a standardized way, under timed conditions, and scored by optical scanners might qualify as a test, it is only one of many possibilities. The type of tests just described may be used more often in large-scale education contexts for gauging student achievement; however, this configuration of a test may be rarely or never used in most other social science education. What defines a test has little or nothing to do with the format or type of questions or tasks presented to test takers. And, as will be shown, although a degree of standardization is useful for some purposes in testing, it is essential to recognize that the characteristics of how test takers provide responses (e.g., orally, on bubble sheets, as performances), how test taker responses are scored (by humans, by scanners, by automated scoring algorithms, etc.), and other features of the setting, timing, and aids used during testing may be largely irrelevant to something qualifying as a test. So what makes a test, a test? Recalling the definition provided earlier, because a test, broadly conceived, is any systematic sample of a person’s knowledge, skill, attitude, ability, or other characteristic
collected under specified conditions, there is a vast number of configurations that would qualify as a test. The issue of care and comprehensiveness in sampling has been addressed; we now turn to the conditions that must be in place so that the sampling yields dependable and accurate information. Standardized Tests It was mentioned earlier that some degree of standardization is useful, where standardization simply refers to the prescriptive administration conditions that a test developer has indicated should be followed. A test developer will typically conduct research to determine, and then carefully specify, the conditions that must be in place for the results of testing to have the meaning intended. The collection of prescribed administration conditions for a given test— ranging from few, informal guidelines to many, highly prescriptive and detailed procedures and prohibitions—are what make a test “standardized.” The medical licensure examination described previously would be called standardized to the extent that certain content coverage was mandated and/or specific time limits were in place. The observations of bullying would be called standardized to the extent that they were collected at prescribed periods during the school day, in specific contexts. The scouting of professional athletes would be called a standardized test to the extent that players were required to attempt the same number and types of basketball shots using regulation equipment and perform the same physical demonstrations. The pollster’s interviews would be called standardized tests to the extent that the same topics were addressed, the same questions asked, and a common checklist was used for noting responses. It should be noted that, although the second characteristic of a test is the presence of specified administration conditions, there may also be deviations from the standard administration conditions that the test developer deems to be allowable because they do not alter what the test attempts to measure or how scores on the test can be interpreted. These allowable deviations are often referred to as accommodations—changes in various aspects of testing that can support the validity of scores obtained from a test. The Standards provide some general guidance on accommodations and distinguish them from changes that do alter the construct being assessed (called test modifications) and undermine the intended interpretations of scores on a test (see AERA, APA, & NCME, 2014, pp. 59–62). A list of categories of some common aspects of testing that a test developer might prescribe is provided below; a more elaborated list with several examples of each category is shown in Table 1.1.
Table 1.1 Aspects of testing a test developer might prescribe/allow to promote validity General category
Specific examples
Mode of Paper or computer-based presentation Text, audio, video presentation Font type, size, or other print or screen characteristics (e.g., display size, resolution) Test instructions or questions read aloud to test taker Braille, ASL, or alternative language presentation of test directions or materials Mode of Written, key-entered, bubbled, oral, performance response Real-time response vs. recorded response Use of a “scribe” to record responses Test Fixed date vs. on-demand test scheduling scheduling Defined testing “windows” during which test may be taken at any point in a range of dates Time of day (e.g., morning, afternoon) Fixed vs. flexible order in which sections of a test must be taken Test setting Seating configurations (e.g., “sit a seat apart”), specifications for computer screen orientation, spacing, dividers, etc. Group or individual administration Allowable setting variations (e.g., distraction-free setting, quiet room) Prescribed lighting, temperature, ventilation, seating, work surfaces, etc. Prohibited test setting materials (e.g., charts, posters, maps, or other materials on walls, doors, desks, etc.) Test timing Specified time limits vs. allowable extended time Allowable breaks, frequency and duration of breaks between test sections or on-demand Assistive tools Use of single-language dictionaries, language-to-language dictionaries, glossaries, reference materials Use of highlighting tools, calculators (paper or computer based), alternative keyboards, touch-screen, switch access, or eye-gaze communication devices, amplification
Mode of Presentation. The test developer might specify the way in which the test directions, questions, tasks, or prompts are presented to test takers. Mode of Response. The test developer might specify the way in which test takers provide their responses to the questions, tasks, etc. Test Scheduling. A test developer might mandate the month, day, or range of time when a test must be taken. Test Setting. The actual physical layout of the space in which examinees take the test may be specified. Test Timing. The allowable amount of time for test takers to complete the test may be specified. Assistive Tools. The test developer might specify which, if any, aids that test takers are permitted to use during the test. Although there may be wide variation in the administration conditions that a test developer specifies, it is the reason that some testing aspects are standardized—that is, why certain test administration conditions are specified—that is most important. Test developers
specify the test presentation mode, mode of responding, scheduling, timing, and other aspects in order to facilitate trustworthiness of the information that the test can provide, and to permit comparisons of information gathered across different persons, settings, or occasions. To the extent that some conditions are left unspecified, the information yielded by the diverse conditions may be variable in terms of its dependability or accuracy, Similarly, variations can introduce confounding factors that make it difficult or impossible to determine if differences in test takers’ performances were due to actual differences in the test takers’ themselves, or merely artifacts of the differing test administration conditions they experienced. Tests and Assessments As a brief aside, sometimes the term assessment is used incorrectly as a synonym for test. For example, it is not uncommon to hear an educator state that they rarely give tests anymore, but they do engage in frequent assessment of students. When pressed on what the student assessments actually look like in practice, they would be—according to the definition presented previously—quite accurately called tests. Prior to its use in education contexts, the term assessment was correctly applied in many other fields, including finance, medicine, and psychology. For example, when a retirement advisor conducts a financial assessment for a client, he or she examines a diverse set of variables (e.g., personal savings, stock holdings, pensions/individual retirement accounts, and other investments, along with liabilities, retirement goals and timeline to provide an overall summary and plan for the client. In the field of medicine, health care workers in an emergency department must rapidly perform an assessment of incoming patients. The assessments may take the form of measurements of blood pressure, temperature, respirations, reflexes, blood tests, x-rays, and other diagnostic procedures. The aim of this constellation of tests is to synthesize all of the diverse pieces of information in order to arrive at a diagnosis and treatment plan. Thus, an assessment is best defined as the collection of many samples of information— that is, many tests—toward a specific purpose. In the context of education, Cizek (1997, p. 10) had defined assessment as “the planned process of gathering and synthesizing information relevant to the purposes of: discovering and documenting students’ strengths and weaknesses; planning and enhancing instruction; or evaluating and making decisions about students.” In every case, assessment involves collecting and summarizing information in order to develop a course of action uniquely tailored to an individual’s needs. In education, the context where the term assessment is perhaps used most accurately is that of Individualized Educational Planning (IEP) team meetings. Such meetings might involve the teacher of a student providing observations of the student’s in-class achievement and behavior, the parents of the student providing insights on the student’s at-home experiences, the school psychologist explaining the results of cognitive testing that has been
performed, and so on. Each of these contributions can be thought of as providing “test” information, because each is a sample of information about a student. The assessment involves aggregating all of the diverse sources of information, arriving at some tentative conclusions about what is happening for the student, and developing some tentative plans regarding appropriate placements, interventions, and supports. Overall, the terms test and assessment have quite different meanings and cannot appropriately be used interchangeably. Nonetheless, it is also likely that the terms are now so pervasively used as synonyms that fussing about the distinction is fruitless. Conclusions about Tests In summary, there are two primary characteristics that define a test: (1) a test is a sample of how a test taker responds to items, tasks, performance demands or other structured requests for information related to some attribute of the test taker that cannot be directly observed, such as knowledge, ability, attitude, and so on; and (2) the sampling of information is conducted in a systematic, specified way so that the test administration yields consistent, accurate, and comparable results.
Inference: Always Required, Sometimes Risky A second essential concept related to validity is that of inference. In describing tests in the preceding section, it was noted that it is typically feasible for a test to capture only a small sample of what could be observed regarding some characteristic of a test taker. However, it is nearly always the case that we wish to—or need to—reach a conclusion or make a decision about the test taker based on that sample of information. For example, suppose it was expected that elementary school students would memorize a set of 169 multiplication facts ranging from 0 × 0 = 0, 0 × 1 = 0, 0 × 2 = 0 … to 12 × 12 = 144. Further suppose that the teacher created a ten-question quiz over the multiplication facts. Now, it would not ordinarily be of great interest to an elementary school teacher that a student can correctly respond to the question: “4 × 8 = ?” For that matter, a student’s answer to any specific question on the quiz would not likely be of great interest to the teacher. Rather, the teacher would intend to arrive at a conclusion regarding the student’s mastery of the whole set of multiplication facts based on the student’s overall performance on the quiz. That is, a teacher would likely conclude that a student who answered nine out of the ten questions correctly had “superior” command of the multiplication facts, whereas a student who answered only one of the ten correctly had very poor mastery. In each case, the teacher would like to extrapolate from the observed quiz performance to reach a conclusion about a student’s overall mastery level of the entire domain of multiplication facts. This act of going from observed test results to a conclusion about a test taker’s standing
on some attribute is called inference. The measurement specialist most known for his work in item response theory, Benjamin Wright (1998), described inference this way: I don’t want to know which questions you answered correctly. I want to know how much … you know. I need to leap from what I know and don’t want, to what I want but can’t know. That’s called inference. Inference is central to all measurement in the social sciences. Because, as noted previously, our samples of information about examinees are necessarily brief and incomplete, and because we typically seek to make conclusions about test takers regarding their standing on attributes that cannot be directly observed, we are limited to making inferences. This is, of course not a bad thing: gathering and correctly interpreting all possible information about a test taker would be prohibitive in terms of cost, time, and burden for all involved. And, the act of making inferences is not limited, but ubiquitous. We make inferences about student understanding from their contributions to a class discussion; we make inferences about a teenager’s driving skill from spending only a few minutes as a passenger; we make inferences about a physician’s rapport with patients from a short clinical interaction; we make inferences about the need for an umbrella from a quick glance outside; we make inferences about potential for a friendship from a brief conversation. As should be obvious, we make inferences—leaps from what we observe to conclusions about what they mean and how to proceed—all the time. Inference is a necessary, useful, and omnipresent activity. Inferential “Leaps” As regards testing, it is important to recognize two qualities of all inferences. The first quality of any inference is that the “leap” required when making an inference can be shorter or longer. Shorter inferential leaps are sometimes referred to as low inference; longer leaps are sometimes referred to as high inference. For example, consider that the only thing a mathematics instructor wants information on is whether a student can recall the formula for the Pythagorean theorem and its meaning. The instructor asks the student to write the Pythagorean theorem in standard notation followed by a brief explanation of what the formula means. The student writes “A2 + B2 = C2. The sum of the squared lengths of each leg of a right triangle equals the squared length of the triangle’s hypotenuse.” This situation requires a very tiny “leap”—that is, a very low degree of inference—on the instructor’s part in concluding that the test taker can indeed recall the formula and what it means: the student just did it! In testing situations like this one that are configured to require low-inference interpretations of test taker performance, we can typically be quite confident in our conclusions about the test taker’s knowledge. Now, however, consider that a flight instructor wants information on whether a student
can safely land a single-engine airplane. The instructor could ask the student to list the steps in landing an airplane, to select a multiple-choice response that lists the steps in correct order, or to orally describe how he or she would land an airplane. Each of these would involve a huge leap—that is, a very high degree of inference—on the instructor’s part to conclude that the student can land an airplane safely. In testing situations like this one that require a high degree of inference, we would typically have very low confidence in conclusions about the test taker’s skill. A test requiring a shorter inferential leap might be desirable, but it also would likely involve actually flying with the student and asking the student to land the plane safely. As can be seen in the foregoing examples, there are typically trade-offs that must be made when choosing higher- or lower-inference testing procedures. Higher-inference testing configurations such as, often, multiple-choice questions, typically require longer inferential leaps; they are more economical and less time-consuming, but they are also less authentic and afford less confident conclusions about test takers’ knowledge, skill, or ability. Lowerinference testing configurations such as demonstrations and performances typically require shorter inferential leaps; they are often more expensive and more burdensome on both the test taker and test administrator, but they are also more authentic and offer greater confidence in our conclusions about test takers’ knowledge, skill, or ability. Inferences: Necessarily Tentative The second important point about all inferences is that they should always be made cautiously and conclusions should be considered to be tentative. Put simply: inferences can often be wrong. Consider the hypothetical situation in which a person hears a commotion at 4:00 a.m. outside his or her third-floor apartment. Looking out the window, illuminated only by a dim street light, the person sees what appears to be a male, dressed in a sweat suit, sneakers, wearing a baseball cap, and dashing down the sidewalk below, clutching a purse. A mugger? Perhaps. That would be a plausible inference. But perhaps equally likely is that the person is racing to catch the owner of a purse that fell out of the car she was getting into just moments earlier. Or perhaps it is a man responding to his wife’s phone call to hurriedly bring the purse she forgot at home to the bus stop where she is waiting. Or perhaps it’s not even a man (that was an inference also), but a female racing with the purse and just doing her morning run. Overall, whereas some inferences may be more plausible than others, all inferences are based on incomplete information. Thus, inferences—especially test-based inferences that are tied to important educational, career, social, or other decisions—should be made cautiously, based on as much high-quality information as practical, and subject to reconsideration as more information becomes available.
Constructs: The Objects of Social Science Research and Development The final key term foundational to the concept of validity is construct. For the most part, the things of greatest interest in any social science context are latent variables—characteristics or attributes that cannot be directly observed but which leave indications of their presence or magnitude in situations designed to elicit them. Constructs don’t exist; they are concepts that are “constructed” by social scientists because of their instrumental value in describing individual differences and observed regularities (or irregularities) in human behavior. For example, it would seem useful to have language that describes a person—let’s call her Prudence—who finds a wallet, identifies its owner, and returns the wallet (with its contents!) to its rightful owner; who tells the truth even when personally inconvenient, embarrassing, or at risk of punishment; who doesn’t exaggerate deductions when computing her income taxes. Of course, not everyone acts in these ways. Lilith might keep the wallet, cheat on her taxes, consistently lie, and so on. Oscar might return the wallet, but cheat on his taxes occasionally, and tell the truth most of the time. In short, the characteristic we observe is not a binary; we notice these individual differences in people and see the characteristic expressed along a continuum. The term honesty was “constructed” to describe this varying characteristic described above, with some people having “more” of the characteristic as we have operationalized it and some people having “less.” The most important point here is also perhaps the most difficult: honesty does not truly exist. It is simply a term of convenience, invented to provide language for communicating about the differences in behavior observed among people. As so artfully phrased by Crocker and Algina, constructs are “the product of informed scientific imagination” (1986, p. 230). Constructs are legion in social science research, and comprise nearly all of the things in which social scientists are interested. In addition to honesty, just a few examples of constructs would include: • • • • • • • • • • •
math problem solving ability anxiety bullying reading comprehension creativity teamwork clinical competence intelligence woodworking skill depression Spanish-language fluency
• • • • •
patient rapport leadership potential introversion college readiness and many others.
In short, nearly all of the valuable outcomes in the social sciences, nearly all variables considered worthy of study, and nearly every intervention developed by social scientists targets one or more constructs.
Tests, Inferences, and Constructs Having developed some foundational concepts underlying validity, we will turn to actually defining validity shortly. For now, however, it seems useful to explicitly link some of the concepts just defined. How are tests, constructs, and inferences related? Tests are designed to elicit responses that provide information about a person’s standing on a construct. For example, a medical board might develop a test to assess potential physicians’ clinical judgment. Note that clinical judgment as used here is the construct of interest. In developing the test, the board might decide on a variety of formats that would provide opportunities to gain information about candidates’ levels of this characteristic. As an entirely separate matter, the board might wrestle with the question of “How much clinical judgment ability should we require in order to judge a medical student as adequate for safe and effective practice?” In effect, the board would be implicitly recognizing that medical students differ in their clinical judgment skills; at some point along a continuum of very poor or absent clinical judgment to very keen or exemplary clinical judgment, the board might establish a level of clinical judgment skill that was “enough.” As to format, the board might consider a multiple-choice situational judgment test (see McDaniel & Nguyen, 2001), in which realistic clinical situations were presented to candidates in text, video, or other medium, and questions would accompany each scenario asking candidates to make various judgments. In responding to each question, candidates might be directed to choose the single correct or best response from among four answer choices. Or, candidates might be instructed to rank each of several possible actions from best/most appropriate clinical judgment to worst/least appropriate clinical judgment. Or, they might be instructed to select the best three of seven actions, and so on. These alternatives are merely a few of the board’s format options. Beyond a multiplechoice format, the board might consider one-on-one interviews with candidates, simulated patient interactions observed and scored by trained raters (Ainsworth et al., 1991), standardized checklists completed by program directors during candidates’ clinical experiences, and others. Each of these formats would appropriately be called a test that
would (assuming the test is of adequate psychometric quality) support inferences about the target construct—namely, candidates’ levels of clinical judgment. Recalling that tests are only samples of information, the board would almost certainly opt for more than one of these items to comprise the test. Ideally, each of the items would elicit information about the construct of interest. Along these lines, it may be useful to think of the individual components of a test (e.g., the items, tasks, questions, etc.) as a collection of tuning forks. One way in which a tuning fork most commonly comes to mind is in the action of striking the tuning fork, causing it to vibrate at a certain frequency, and observing the tone emitted. Any individual tuning fork emits a specific frequency when struck; tuning forks emit different tones depending on their size, shape, length of prongs, and other factors. For example, a tuning fork emits a tone when struck at 440 hertz (Hz)—the note played (A) on the first string of a correctly tuned viola. A less commonly known phenomenon involving a tuning fork is called sympathetic resonance. In some situations, a tuning fork will vibrate when it is not struck, but when it is in the presence of the same frequency to which it is tuned. Following the above example, the 440 Hz tuning fork will not only emit its tone when struck, but it will also emit that same tone when the first string of the correctly tuned viola is played in its presence. Notably, the tuning fork will not resonate in the presence of another pitch, such as that emitted by an outof-tune viola, the pitch emitted by the second or third string of a viola, and so on. In creating tests involving, for example, a collection of multiple-choice items or tasks, the individual items and tasks can be thought of as individual tuning forks, each intended to resonate in the presence of the construct the test is designed to measure. Ideally, every item or task comprising a test would be equally good at detecting (“resonating with”) the presence of the intended construct. By analogy, a collection of test items that exhibited sympathetic resonance at 500 Hz would fail to accurately detect the presence of a 440 Hz tone. However, a collection of test items expertly designed and crafted to exhibit sympathetic resonance at 440 Hz would clearly show when that tone was present and, by their failure to exhibit sympathetic resonance, would clearly show when when a tone of 440 Hz was not present. In summary, tests comprise opportunities to obtain samples of information about characteristics of interest. Test are designed to detect variation in those characteristics that cannot be directly observed. Nearly all of these characteristics can be called constructs: they cannot be directly observed, and they do not truly “exist” in any physical sense. Constructs function merely as convenient labels to describe observed variations in areas of human behavior that we find to be interesting, valuable, useful, or related to other characteristics or variables. Furthermore, the results of tests do not represent definitive findings; rather, the results of a test are intended to support inferences of the test takers’ standing on the underlying construct. Those inferences are typically more warranted when based on larger, more complete samples
of information, derived from well-designed and carefully administered tests. However, regardless of how well-designed and carefully administered a test may be, the potential for making incorrect inferences based on test results is ever present. Any inferences about standing on a construct derived from use of a test should always be considered tentative, and evaluated based on the level of support for the inference provided by the test and/or other sources of information.
Validity Defined In this section, foundational concepts that provide necessary background for understanding validity have been presented. But what is validity? Incorporating some of the foregoing concepts into a definition yields the following: Validity is the degree to which scores on an appropriately administered test support inferences about variation in the construct that the instrument was developed to measure. There are several aspects of this definition that warrant brief elaboration here, with greater attention to come.
Validity Concerns Intended Score Meaning First, it is perhaps obvious that this definition focuses squarely on the intended meaning of test scores and the degree to which there is support for the inferences that a test developer intends to be made from those scores. A clear implication of this focus is the requirement that a test developer actually, formally, clearly, and publicly states the intended score inference(s). It is somewhat surprising—given that erroneous test score interpretations are unintentionally invited when such formal statements are lacking—the frequency with which test documentation (ranging from candidate preparation materials to score reports, to technical manuals) lacks this essential foundation for understanding the meaning of a test performance. An equally clear recommendation is that all test developers begin the test development process by producing such a formal statement of intended score meaning. Doing so not only provides essential information to users of the test scores, but also serves to keep the test development process focused on the intended score meaning and—most importantly as regards the topic of this book—provides a roadmap to appropriate validation efforts.
Validity Is a Property Second, it should be noted that validity is a property or quality of the evidence in support of intended test score meaning. In subsequent chapters, some possible ways of gathering and
evaluating that evidence will be presented, and we will find that there is broad agreement that, as Messick has stated: “What is singular in the unified theory is the kind of validity: All validity is of one kind, namely, construct validity” (1998, p. 37). However, the very notion of “kinds” of validity is misguided. If validity is a property or quality of evidence in support of the intended meaning of a test score, it makes no sense to talk about one kind, or many. Consider any other context in which the quality of something is expressed. Suppose one were describing the quality of apples in terms of sweetness. There may be many different kinds of apples (Macintosh, Gala, Granny Smith, Red Delicious, and so on, just as there are many different kinds of tests), but there are not different “kinds” of sweetness, just different degrees of that property and different ways of attempting to characterize it.
Validity Does Not Define a Construct Third, according to the definition of validity provided, a test does not define the construct it seeks to measure but is, ideally, highly responsive to it. The situation is analogous that of the tuning fork described previously. Unperturbed, the tuning fork remains silent. And, in the presence of a stimulus frequency other than its own, the fork does not respond. However, when a source producing the same natural frequency is placed near the tuning fork, the fork resonates. The tuning fork is not the frequency; tuning forks do not define pitch; they merely respond to the presence of the frequency there are designed to produce. By extension, a test is not the construct, and an instrument does not define a characteristic. Rather, a good test—that is, one that yields accurate inferences about variation in a characteristic—will resonate in the presence of the construct it was designed to measure. It should also be emphasized that this attribute of responsiveness of a test to the characteristic it was designed to measure does not mean that mere covariation is sufficient evidence of validity. High fidelity between variation in scores on an instrument and underlying construct variation affords some confidence that the intended inferences about the construct are appropriate. Low fidelity suggests that the characteristic being measured has not been adequately conceptualized, that factors other than—or in addition to—the intended construct play a large role in score variation, and that caution is warranted when making inferences from scores about persons’ standing on the characteristic.
Validity of Intended Inferences and Attention to Test Score Use Fourth, it is perhaps also obvious that the preceding definition of validity does not incorporate the use of test scores. It does not allude to concepts such as fairness, social justice, inappropriate uses of test scores, or uses for which the test was not intended. That omission is purposeful and, as we will see in a subsequent chapter, both logically and practically imperative.
Decades of conflating the notions of test score meaning and test score use have had farreaching and deleterious effects. Well-intentioned attempts to force concern over test use into a “unified” definition of validity that also incorporates attention to score meaning have been far from unifying. The attempts have confused test users, flummoxed those involved in test validation, and relegated the field of measurement to a theory and practice in its specialty lacking what the field itself asserts to be “the most fundamental consideration in developing tests and evaluating tests” (AERA, APA, & NCME, 2014, p. 11). Although many examples of how confused the field of measurement is about the concept of validity can be seen in the dozens of theoretical articles published on the question, there may no better practical example than that provided by perhaps the foremost contemporary validity theorist, Michael Kane, who has stated that “I am more concerned about how to evaluate the validity of proposed interpretations and uses of test scores, than I am in getting a very precise definition of the term validity” (2012, p. 66). It is difficult to imagine how one could assert great concern about something for which one lacks a clear definition. It is easy to imagine how a field could be conceptually adrift in such a situation. All of that is not to say that concern about the use of test scores is less important than validity, a secondary concern, or does not have a place in a comprehensive framework for defensible testing. Far from it. Indeed, as we will see, just as there are long-standing standards and traditions (primarily psychometric; see Chapter 2) for how one can go about gathering and evaluating evidence in support of an intended score meaning (notably, the Standards for Educational and Psychological Testing, AERA, APA, & NCME, 2014), there should be equally well-developed and instructive standards and traditions for how one can gather and evaluate evidence in support of an intended score use (see Chapter 5).
Validity and Validation Finally, and following up on the preceding definition of validity, it is useful to also define the term validation. Although the terms are sometimes used interchangeably—and, again, to harmful effect—they are clearly different. Whereas validity is a property, validation is a process. Validation is the ongoing process of gathering, summarizing, and evaluating relevant evidence concerning the degree to which that evidence supports the intended meaning of scores yielded by an instrument and inferences about standing on the characteristic it was designed to measure. That is, validation efforts amass and synthesize evidence for the purpose of understanding and articulating the degree of confidence that is warranted concerning intended inferences. This definition is consistent with the views of Messick (1989), Kane (1992), and others who
have suggested that validation efforts are integrative, subjective, and can be based on different sources of evidence such as theory, logical argument, and empirical evidence. This definition of validation also bears on Messick’s (1989) previously referenced assertion that all validity is construct validity. While it may be tolerable shorthand to speak of all validity as construct validity, that construction is too simplistic. What is more accurate is to say that all validation is conducted for the purpose of investigating and arriving at judgments about the extent to which scores yielded by an instrument support inferences with respect to the intended construct of interest.
Why Is Validity So Important? It may be tempting to conclude that the preceding attention to the concept of validity has focused on what are mainly technical or scientific concerns about a concept primarily of interest to those in the field of psychometrics. Such a conclusion would be seriously inaccurate. The validity of test scores—that is, the degree to which there is evidence that the scores can be taken to mean what they are intended mean—is of great consequence for all stakeholders who take, interpret, or use tests. The reason this is the case is because tests are commonly used to inform important decisions that must be made. To be clear, test scores are rarely the only piece of information used in such decision, but they are often the piece of information that is of the highest quality —or at least the piece of information for which the quality of its information has been researched and made available. The decisions based on test results are both varied and consequential. For individuals, they include admission to colleges, awarding/denial of licenses to practice in one’s chosen field, placement in specialized educational programs, diagnosis of psychological issues and eligibility for treatment, selection for job opportunities and other opportunities with social, educational, economic, or professional implications. In what is surely a fundamental, consequential, and ubiquitous context, the very freedom to move about typically depends on the score one obtains on a driver’s license test. Test scores do not only affect individuals, but organizations and social institutions. For better or worse, school systems are often evaluated based on test scores, with funding variously channeled to or withheld from those systems that are deemed to be underperforming. Policy makers often consider test scores when evaluating the success of policy initiatives, when contemplating reforms, and when making funding decisions. As citizens of a particular country, the public is often presented with aggregate test score information via media outlets, ostensibly to be interpreted as indicators of the quality of social institutions such as schools and hospitals, and as information from surveys (tests!) regarding the body politic’s views on social, political, and economic issues.
In all of these cases, decisions are directly related to or are influenced by test scores— decisions to place an incoming freshman in Spanish II; decisions to refer a patient for anxiety treatment, decisions by parents on the neighborhood in which to purchase a home, decisions by professional associations to award an advanced credential, decisions to refer a student for special educational services, decisions about which candidate or political position to support, decisions to eliminate funding for an early childhood program. The foregoing is not an endorsement of any specific decision, but a recognition that important decisions are made based on test scores. More than that however, it is an admonition and encouragement that whenever such consequential decisions are to be made based in whole or in part on test scores, the information provided by those tests should be of the highest quality possible. There should be the greatest confidence that the test data have the meaning they are claimed to have; that is, the data should have validity.
Summary This introductory chapter has laid out some of the main theses of this book that will be developed in the chapters that follow. First, some underlying concepts necessary for fully understanding validity (e.g., test, inference, construct) were described. Then, definitions of validity and validation were proposed and explicated. Validity was described not merely as an esoteric concept of interest only to those within a narrow disciplinary specialty, but as a characteristic of test scores that should concern parents, patients, students, therapists, policy makers, educators, citizens—that is, everyone who takes, uses, or consumes information based on test scores for making important decisions. Along the way, allusions were made to some flaws in current validity theory and controversies related to validity among those within the field of measurement. However, there are substantially more areas of agreement about validity between those within the field than there are areas of disagreement, and the areas of agreement are both profound and important to grasp. The next chapter provides an overview of these broad-based areas of consensus.
2 VALIDITY The Consensus
One would expect … fierce discussions on the most central question one can ask about psychological measurement, which is the question of validity. It is therefore an extraordinary experience to find that, after proceeding up through the turmoil at every fundamental level of the measurement problem, one reaches this conceptually highest and presumably most difficult level only to find a tranquil surface of relatively widespread consensus. (Borsboom, 2005, p. 149)
Although some aspects of modern validity theory and practice warrant reconsideration or have had some history of controversy, there is much thinking about validity that has broad support. In fact, there are substantially more areas of agreement than disagreement. This chapter summarizes those areas of consensus and provides additional foundation for the fuller explication of validity that is presented in Chapter 4. This chapter begins with an introduction to one source of that consensus: the Standards for Educational and Psychological Testing.
The Long Tradition of Professional Standards Related to Validity As is the case in perhaps every scientific discipline, the thinking regarding key concepts in the field of measurement has evolved. Geisinger (1992) noted that the concept of validity had evolved appreciably over the previous 60 years. Excellent historical accounts of that evolution have been provided by several scholars (see, e.g., Kane, 2006a, Messick, 1989). Somewhat briefer accounts are provided in Lissitz and Samuelson (2007) and Jonson and Plake (1998). The most recent history, by Kane and Bridgeman (2018), provides a highly accessible summary of major developments in validity theory since 1950. A first commonality over the past 70 years is the existence of professional standards applicable to the development and evaluation of the instruments used in education, psychology, credentialing, and numerous related fields. What is broadly considered to be the defining compilation of best practices in assessment began with the development and publication of the Technical Recommendations for Psychological Tests and Diagnostic Techniques by the American Psychological Association (APA, 1954). This set of best
practices is now commonly referred to as the “Joint Standards” (because other large professional associations—namely, the American Educational Research Association and the National Council on Measurement in Education—joined the APA in co-sponsoring the document), or simply as “the Standards.” Indeed, beyond sponsorship, numerous professional associations, testing companies, and other entities whose members routinely develop and use the kinds of instrument covered by the Standards have formally endorsed them. The Standards are now in their sixth edition (AERA, APA, & NCME, 2014). With a new edition published roughly every ten to 12 years, the Standards guide test development practice and evaluation as the only comprehensive, broadly accepted, authoritative compilation of best practices available. The content of the Standards has also evolved. The 1954 edition was first published as a 38-page journal article in the Psychological Bulletin; excluding some introductory material, the first Standards comprised 32 pages of individual standards, organized into six sections: Dissemination of Information, Interpretation, Validity, Reliability, Administration and Scoring, and Scales and Norms. Validity played a leading role in the first edition of the Standards: of the six sections, the section addressing validity was the longest, with its 16 pages of standards comprising half of the volume and listing 66 individual standards. Each of the individual standards listed in the Technical Recommendations for Psychological Tests and Diagnostic Techniques, and the individual standards listed in the second edition, the Standards for Educational and Psychological Tests and Manuals (American Psychological Association, 1966), bore a qualitative label describing the importance of each individual recommendation. The labels included Essential, Very Desirable, and Desirable. Again, the value placed on validity can be seen as quite high, with 46 of the 66 validity standards labeled as Essential, 17 as Very Desirable and only three standards labeled as Desirable. The second edition of the Standards (1966) was only slightly revised from the initial version, expanding from 38 to 40 pages due mostly to the addition of an index. More substantial changes have occurred over subsequent iterations of the Standards; the authors of the original version would likely recognize little in common with the latest (2014) edition. The current edition comprises 230 pages covering 13 diverse topics shown in Table 2.1. The number of individual standards has increased from 66 in the first edition to 240 in the sixth.
Table 2.1 Chapter topics, 2014 Standards for Educational and Psychological Testing Chapter
Title/topic
1 2 3 4 5 6 7 8 9 10 11 12 13
Validity Reliability/Precision and Errors of Measurement Fairness in Testing Test Design and Development Scores, Scales, Norms, Score Linking, and Cut Scores Test Administration, Scoring, Reporting, and Interpretation Supporting Documentation for Tests The Rights and Responsibilities of Test Takers The Rights and Responsibilities of Test Users Psychological Testing and Assessment Workplace Testing and Credentialing Educational Testing and Assessment Uses of Tests for Program Evaluation, Policy Studies, and Accountability
Despite fairly extensive changes between 1954 and 2014, common themes are evident, particularly regarding the primacy of validity. By various surface-level metrics, validity is recognized as preeminent among the topics covered by the Standards: throughout the six editions of the Standards, validity has not only consistently been the first chapter in each edition, it has also consistently been the longest chapter. In the current edition of the Standards, the chapter on Test Design and Development comprises the same number of individual standards (25) as the validity chapter. Contemporary scholarship on validity has also uniformly recognized the importance of validity in general. Despite some continuing areas of disagreement in the field, there are many specific aspects of validity that are broadly endorsed. In the next section, some of the main points of consensus are listed and described.
Areas of Consensus in Modern Validity Theory If there is any orthodoxy in modern validity theory, it exists around a six-part canon related to key characteristics of validity. In addition to agreement on its importance, there is broad professional agreement concerning many of the chief features of modern validity theory and six salient characteristics are essentially uncontested. These characteristics derive mainly from the works on validity by Cronbach (1971; 1988) and Messick (1988; 1989; 1995). The six tenets are shown in Table 2.2 and described in the following sections.
Table 2.2 Six foundational tenets of contemporary validity theory 1 2 3 4 5 6
Validity pertains to test score inferences. Validity is not a characteristic of an instrument. Validity is a unitary concept. Validity is a matter of degree. Validation involves gathering and evaluating evidence bearing on intended test score inferences. Validation is an ongoing endeavor.
Principle 1: Validity Pertains to Test Score Inferences First among the accepted tenets is that validity pertains to the meaning, interpretations, or inferences that are made from test data; i.e., test scores. Because latent traits, abilities, knowledge, and so on cannot be directly observed, the vast majority of the variables studied in social and behavioral sciences must be studied indirectly via the instruments developed to measure them. The indirect measurement is necessarily a proxy for the characteristic of interest, and inference is required whenever it is desired to use the observed measurement as an indication of standing on the unobservable characteristic or construct. As quoted previously in this volume, the pervasive and inevitable act of inference in social science measurement was expressed by Benjamin Wright, who described the nature of inference in the context of achievement testing: I don’t want to know which questions you answered correctly. I want to know how much … you know. I need to leap from what I know and don’t want to what I want but can’t know. That’s called inference. (Wright, 1998) Wright’s description is valuable in itself for many reasons. For one, it highlights the importance, in test development, of careful sampling from the domain of interest and controlled testing conditions in order to support more confident inferences from examinees’ observed performances to the construct of interest. Beyond that, however, there is a clear implication of Wright’s description for the practice of validation: because validity applies to the intended inferences or interpretations to be made from test scores, it follows that a clear statement of the inference(s) a test is intended to support is necessary in order to engage in validation. The current Standards suggest that “validation logically begins with an explicit statement of the proposed interpretation of test scores” (2014, p. 11), but that suggestion seems too narrow. Rather, it would seem most appropriate to mandate that a clear statement of the intended score inference(s) should be formalized at the very beginning of the testing enterprise, as it can (and should!) drive the entirety of the test development, evaluation, score reporting and interpretation, and validation processes.
Principle 2: Validity Is Not a Characteristic of an Instrument A second point of broad consensus in modern validity thought is a corollary to the first. Whereas the first point describes what validity is—i.e., it pertains to test score inferences— the second area of agreement centers on what validity is not. Namely, validity is not a characteristic of an instrument. A test, qua test, has no inherent degree of validity. As stated early on by Cronbach: “One validates, not a test, but an interpretation of data arising from a specified procedure” (1971, p. 447); Messick reaffirmed this point in his influential chapter on validity, writing that “what is validated is not the test or observation device as such but the inferences derived from test scores” (1989, p. 13). The most recent editions of the Standards indicate that “It is incorrect to use the unqualified phrase, ‘the validity of the test’” (2014, p. 11), and “it is the interpretations of test scores that are evaluated, not the test itself” (1999, p. 9). The notion was put must succinctly by Shepard: “Validity does not adhere in a test” (1993, p. 406). In order to highlight this important characteristic of validity in both academic writing and in practice, measurement specialists broadly agree that it is inappropriate to say that “Test ABC is valid.” Rather, it is correct to say that “Test ABC yields valid data about X” or, more precisely, to say that “There is strong support that scores on Test ABC can be interpreted to mean X.”
Principle 3: Validity Is a Unitary Concept Although a widely accepted principle of modern validity is that validity is a unitary concept, that was not always the case. An early view was that there were different “kinds” of validity. The first edition of the Standards explicitly listed content validity, predictive validity, concurrent validity, and construct validity under the heading “Four Types of Validity” (APA, 1954, p. 13). Later, Guion (1980, p. 385) articulated what came to be known as the trinitarian view of validity, so named for three main kinds of validity: content, criterion, and construct validities, with the earlier predictive and concurrent validities combined under the umbrella of criterion validity. Since that time, the number of “kinds” of validities has grown exponentially; a study by Newton and Shaw (2012) identified 119 kinds of validity in their review of 22 journals in education and psychology over the period 2005–2010. As a few examples, among the commonly encountered “kinds” of validity were instructional validity, ecological validity, curricular validity, consequential validity, decision validity, intrinsic validity, and structural validity. The authors note that the list of what they called “validity modifier labels” was purposefully abridged and “could have been substantially longer” (p. 1). They also note that their total was a purposeful undercounting due to the omission of at least 30 kinds of proposed validities that were essentially synonymous with existing list entries. The total has
surely grown since 2010. Ironically, the proliferation of kinds of validities is inversely proportional to the trending consensus in contemporary thinking on validity which has moved firmly from the four kinds of validity listed in the first edition of the Standards and the three kinds of validity identified in the trinitarian view to the current notion that there is only one kind of validity: construct validity. The current view of validity was foreshadowed over 30 years ago in the then-current edition of the Standards which indicated that “these [three] aspects of validity can be discussed independently, but only for convenience” (APA, AERA, NCME, 1974, p. 26). It follows that, if the three aspects of validity only existed for convenience, then validity must be a singular concept. This conclusion was articulated by Loevinger who concluded that “since predictive, concurrent, and content validities are all essentially ad hoc, construct validity is the whole of validity from a scientific point of view” (1957, p. 636). Subsequently, the conclusion representing modern validity theory was formalized by Messick who concisely asserted that “validity is a unitary concept” (1989, p. 13). That phrasing stuck. The modern conceptualization of validity is now commonly referred to (and widely endorsed) as the “unitary view” of validity. In describing the unitary view, Messick indicated that “what is singular in the unified theory is the kind of validity: All validity is of one kind, namely, construct validity” (1998, p. 37). Of note is that, in another work, Messick articulated a very slightly but noticeably contradictory position, stating that “construct validity embraces almost all forms of validity evidence” (1989, p. 17, emphasis added). As we will see later in this volume, it has not been possible for validity to embrace all forms of validity evidence as currently conceptualized due to the fact that some forms of evidence don’t actually bear on the validity of intended score inferences, but on something else. For now, one thing is certain and represents a clear consensus in the field of measurement: the differing “kinds” of validity have been replaced in contemporary validity theory by differing sources of validity evidence. These sources of evidence will be described shortly. For now it may be useful to modestly refine Messick’s conclusion to best represent the modern consensus. Although it may be convenient shorthand to say that “all validity is construct validity,” it is perhaps more accurate to say that “all validity evidence brought to bear when investigating support for an intended inference is evidence bearing on the construct that is the focus of the measurement procedure.” This somewhat more nuanced phrasing reflects the contemporary consensus that validity is a unitary concept with diverse sources that can be mined to produce support for the intended inference(s).
Principle 4: Validity Is a Matter of Degree A fourth accepted principle of contemporary validity theory is that judgments about validity are not absolute statements about the presence or absence of a characteristic, but are best
described along a continuum of strength of support for the intended inference. As Messick has indicated: “Validity is a matter of degree, not all or none” (1989, p. 13). Similarly reflecting this consensus, Zumbo has stated that “validity statements are not dichotomous (valid/invalid) but rather are described on a continuum” (2007, p. 50). There are many reasons why conclusions about validity must be stated as a matter of degree. For one, when evidence bearing on the intended score meaning is amassed as part of a validation effort, the evidence is routinely mixed. For another, the evidence is variable in terms of how directly it bears on the intended inference. Also, the evidence gathered typically varies in terms of the weight of support it provides. Further, the body of evidence can often point in different directions in terms of its favorability/unfavorability with respect to the intended inference; that is, the evidence may lend support to the intended interpretation or it might suggest that the intended inference is contraindicated. Even in cases where a test developer may want to make the most definitive statement possible regarding support for an intended score inference, it is typically the case that the most dispositive validity evidence that might be desired cannot be gathered because of time, logistics, or other practical or ethical considerations, and less definitive validation sources of evidence must be mined. Finally, claims about the strength of validity evidence must be tentative and expressed as a matter of degree because of the common (albeit unintended) penchant to seek out only evidence that supports the intended inference—the phenomenon that Cronbach called a “confirmationist” bias (1989, p. 152). In addition to converging on the unitary view of validity, most assessment specialists now also agree that potentially disconfirming evidentiary sources should be as intentionally pursued as sources expected to yield supporting evidence.
Principle 5: Validation Involves Gathering and Evaluating Evidence Bearing on Intended Test Score Inferences The fifth broadly endorsed tenet of modern validity theory is that the practice of validating intended test score inferences involves first collecting a body of evidence from various sources, then integrating, synthesizing, and evaluating that evidence to arrive at a conclusion regarding the extent to which the evidence supports the intended inferences. Kane has observed that “validity is an integrated, or unified, evaluation of the [score] interpretation” (2001, p. 329). As noted previously, these evaluations of score interpretations or inferences do not result in conclusions such as, “Test ABC is valid.” Rather, as Messick has noted, validation efforts yield statements representing “integrated evaluative judgment[s] of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences … based on test scores” (1989, p. 13). The nature of the validation process is highly similar to that of a trial in legal contexts. Defendants in court cases are not proven to be innocent or guilty. Although the decisions
reached as a result of jury deliberations may be a dichotomy (guilty/not guilty), the body of evidence considered by a jury rarely, if ever, points solely toward an unambiguous conclusion. Bodies of evidence gathered by the prosecution (supporting evidence that a crime was committed) and by defense attorneys (disconfirming evidence) are weighed by a judge or the members of a jury for the extent to which the entire body of evidence supports a certain verdict. Case deliberations may take weeks before a verdict is reached and the dichotomous nature of the eventual outcome belies the continuous nature of the evidence. For example, in legal contexts, the standards used to arrive at a conclusion are along the lines of “the preponderance of evidence” (the standard used in most civil cases), “clear and convincing evidence” (the standard used in some civil cases), “beyond a reasonable doubt” (the standard applicable to most criminal cases), or “substantial evidence” (the standard used in many administrative law proceedings that requires a plaintiff “to provide enough evidence that a reasonable mind could accept as adequate to support a particular conclusion”) (Justia, 2019, p. 1). Similar to legal proceedings, the validation process involves gathering and summarizing favorable evidence that the inferences intended to be drawn from scores on a test are supported, as well as disconfirming evidence that may suggest that the intended inferences are not warranted. Thus, like the legal context, it is rare that a body of evidence gathered in the course of validation uniformly and unequivocally supports one side or another. Once all of the available evidence has been collected and summarized, just as is done by juries, the evidence gathered in the validation process must be evaluated and an overall “verdict” reached as to the likelihood that scores on Test ABC can be interpreted as intended by the test developer. Finally—and again not unlike trials—it must be noted that the process of validation necessarily involves the exercise of judgment. Evidence bearing on intended score inferences must not only be gathered and summarized, it must be weighed and evaluated. Each of these activities involves human judgment. Just as in jury deliberations, some jurors may place more weight on some pieces of evidence than on others and two jurors presented with the same body of evidence may come to different conclusions, so is the case with validation: human values and judgment come into play when ascribing more or less weight to some of the available evidence, and equally qualified specialists could come to different conclusions with respect to whether the body of evidence is sufficient to conclude that scores on a test can be confidently interpreted as intended. As Messick (1975, 1998) has described, meaning and values are inescapably brought to bear when validity judgments are made, rendering simplistic, putatively objective statements about “the validity of a test” impossible.
Principle 6: Validation Is an Ongoing Endeavor The sixth and final point of consensus regarding contemporary validity theory is that
validation is an ongoing enterprise for any test score interpretation. Just as it is incorrect to say that a test is valid, so it is incorrect to say that the validity case for an intended inference is ever closed. There are many reasons why this would be so; developments that could provide additional supporting (or disconfirming) evidence and thus strengthen or question initial conclusions about validity include: • • • • • •
replications of original validation efforts; new applications of the instrument; changes in the nature of the examinee population; previously unavailable sources of validity evidence; evolution in theory and practice related to a construct, and other scientific findings from within and beyond the discipline subsuming the construct of interest.
All of these factors may conspire to alter the mix of information that constitutes the empirical evidence and theoretical rationales undergirding the inferences that can be made from test scores. As a consequence, it is certainly conceivable that an altered mix of validity evidence could be cause for rejection of existing conclusions about the validity of test scores in favor of a differing “verdict.” Thus, it is incorrect to state that the validity case for a test is ever closed, and on this point there is also broad professional endorsement. For example, Messick has stated that “validity is an evolving property and validation is a continuing process” (1989, p. 13), and Shepard, summarizing “every … treatise on the topic”, has suggested that “construct validation is a never-ending process” (1993, p. 407). In conclusion, there is widespread agreement that most defensible validation process is not a one-time activity, but an ongoing endeavor to ensure that there is continuing support for a test’s intended inferences, qualification of those inferences, or discovery that the intended inferences are no longer adequately supported.
Sources of Evidence for the Meaning of Test Scores The preceding sections of this chapter have referred to the long history of professional best practices for assessment, and to areas of broad agreement regarding some fundamental principles in modern validity theory and practice. As could be seen, in rejecting different kinds of validity in favor of a unitary view, contemporary validity theory has incorporated the notion of “sources of evidence” for intended test score inferences. The distinction between “kinds” and “sources” is far from being merely semantic. As was suggested previously, modern validity theory has embraced a unitary view of validity; if all validity is of one “kind” then it does not make sense to talk about different “kinds” of evidence, but of
differing “sources” of evidence that can be mined for support relative to the intended test score inferences. In terms of potential sources of evidence for intended score inferences, there is also a long history and—mostly—broad agreement on the sources as well. In addition to the six principles described above, the Standards for Educational and Psychological Testing (AERA, APA, & NCME 2014) describe some consensus sources of evidence that might be gathered. Four such sources—Evidence Based on Test Content, Evidence Based on Response Processes, Evidence Based on Internal Structure, and Evidence Based on Relations to Other Variables—will be described and illustrated in the following subsections of this chapter. At least two questions are relevant to the gathering of validity evidence. For one, as we will see, answering the question of which sources of validity should be collected from among the four possible sources, depends on the intended inferences. In the following subsections, examples of these pairings of intended inferences and sources of evidence are provided. A second logical and important question is, “How much evidence is enough?” This question will be addressed in Chapter 6.
Evidence Based on Test Content First among the sources of validity evidence identified in the Standards is Evidence Based on Test Content. According to the Standards, “validity evidence can be obtained from an analysis of the relationship between the content of a test and the construct it is intended to measure” (2014, p. 14). Although it would be professionally responsible to continue gathering evidence based on test content over the life of a test, evidence from this source is typically collected at the stage of test development as it provides a foundation for supporting the claim that, when the test is subsequently used, scores on the test can be interpreted to reflect examinees’ standing with respect to the domain or characteristic of interest. It is important to note at this point that the term “content” may have a somewhat narrow connotation. Fitzpatrick (1983) has described a variety of ways in which the term “content validity” can be understood. For purposes of applying the concept of Evidence Based on Test Content to the widest variety of applications, in this chapter, the term is interpreted in the broadest sense. Two examples can illustrate this breadth. As one example, evidence based on test content for a 50-item educational achievement test of fourth grade mathematics problem solving skill might take the form of an analysis where items in the test are reviewed by qualified content specialists for the degree to which each of the items matches an approved mathematics curriculum outline, a statewide adopted set of fourth grade mathematics content standards, or other document that lists the specific knowledge and skills that have been formalized as comprising the domain of interest. In this simple case, the more items in the test judged to match the approved content standards, the stronger the validity evidence based on test content.
Figure 2.1 provides a graphical illustration of this process. The figure shows a hypothetical ten-item test that is intended to cover the ten content standards (sometimes called “objectives” or “targets”) shown in the far right column of the figure. Each of the arrows in the figure reflects the judgments of qualified reviewers as to whether each of the test items matches one of the intended content standards. As can be seen in the figure, Items 1 and 3 were judged to match Standard 4.1.3. Item 2 was judged to match Standard 4.1.1, and so on. Such matches (again, likely also depending on the judged strength and completeness of the item-to-standard matches) provides evidence of validity based on test content. However, the figure also illustrates some aspects of the alignment study that weaken confidence in the intended score inferences; these areas of weakness are identified by the shaded cells of the left-hand and right-hand columns. For example, the fact that Item 4 was judged not to match any of the relevant content standards (also indicated by the absence of an arrow from Item 4 to any of the content standards) weakens the evidence for validity based on test content. Further, the fact that no item in the test measures a content standard deemed to be an important part of the domain (the shaded cell corresponding to Standard 4.3.1) weakens the intended claim that scores on the test can be taken as indicators of the extent to which examinees have mastered the body of knowledge and skills represented by the set of content standards.
Figure 2.1 Graphic representation of alignment evaluation There are certainly designs for gathering evidence based on test content that are superior to others and that provide more rigorous evidence of validity than the design just illustrated. Providing qualified reviewers with the content standard each item is intended to match and asking them to make a dichotomous judgment for each item as to whether it matches the identified content standard would provide fairly weak support. There are many ways in which this design could be improved: for one, it would be a stronger design if the reviewers were not provided with the intended content standards for the items and they were first asked to make judgments about which—if any—of the set of content standards each item matches; for another, instead of a dichotomous judgment about the item-to-standard matches, the
reviewers could be asked to rate (on, say, a 0–5 scale) the strength of the match, with 0 representing no match to the intended content standard and 5 representing a complete or strong linkage. Studies such as those just described are typically referred to as alignment studies (see, e.g., Cizek, Kosh, & Toutkoushian, 2018; Webb, 1997). It is easy to see how the aforementioned evidence would be labeled as “evidence based on test content.” However, the term “test content” captures much broader types of evidence and is appropriate to a broader array of testing situations. A second simple example to illustrate the breadth of situations to which Evidence Based on Test Content applies can be seen in the case of a test under development to measure the construct depression. In such a situation, it makes no sense to talk about alignment of proposed items in that test to an approved set of “content standards” for depression. It is, however, appropriate to conceptualize the “content” of the proposed depression measure as that set of behaviors, dispositions, or perceptions that comprise what current theory, clinical practice, or other expert consensus deem to be important attributes of that construct. In this case, those elements—comprising the theoretical or clinical consensus regarding the construct of depression—are analogous to the “approved” content of the test. By extension, evidence based on test content could be gathered by developing items for the proposed depression instrument to measure those elements and by again incorporating a formal review by experts in the field as the extent to which the items were aligned to the theoretical and clinical consensus regarding the construct. Referring back to the six points of consensus regarding validity, the principle regarding the unitary nature of validity (and “all validity is construct validity”) can be seen in the preceding examples. Whether the measurement target is fourth grade mathematics problem solving skill or depression, the evidence gathered related to the content of those instruments is gathered for the purpose of supporting the intended inferences about examinees’ standing on those constructs. Finally, there are many more potential sources of evidence to support the meaning of test scores that would fall under the umbrella of Evidence Based on Test Content; those engaged in validation efforts should identify and gather the fullest set of evidence relevant to the intended score interpretations. Table 2.3 lists several illustrative examples: some of the examples listed would be most appropriate for validation efforts related to instruments intended to measure psychological constructs; some would be more appropriate for validation activities related to educational achievement tests (including tests used for competence assessment in licensure and certification contexts); some are relevant to both contexts.
Table 2.3 Examples of sources of evidence based on test content for educational and psychological tests Potential sources of evidence based on test content • Domain delineation/identification of knowledge and skills to be tested by subject matter experts (SMEs), curriculum review, or other content analysis • Construct operationalization by review of relevant theory and/or clinical observation • Identification of and grounding test development in relevant theoretical dimensions or relationships • Use of evidence-centered design or assessment engineering principles in test development • Content/curricular domain studies • Job analysis, role delineation studies • Item generation by subject matter experts (SMEs) or automated procedures based on task models developed by SMEs • Development and review of item keys, scoring rubrics, observation protocols, etc., by qualified SMEs • Review of developed items/tasks for age, grade, developmental level, reading level, etc., appropriateness by qualified specialists • Review of items for grammatical and related linguistic concerns by qualified specialists • Review of developed items/tasks for sensitivity, fairness by qualified specialists • Alignment studies • Item/task evaluation based on pilot testing in sample from population of interest
Rapidly emerging as relevant to the development of evidence based on test content are the related notions of what is called evidence-centered design (ECD) and assessment engineering. At the risk of oversimplification, both of these approaches take the perspective that the gathering of validity evidence is not primarily a post hoc collection of evidence, but a set of design principles and activities pursued throughout the test development and administration process. Both approaches generally begin with explicit statements of a claim that is intended to be made about examinees based on their test performance. This notion of a claim is similar to the central validity concept of the intended inference to be made from scores on a test. In ECD, a claim refers to a statement or conclusion about examinee knowledge, skill, ability, or other characteristics that is reached via the ECD argument, where the argument represents a series of assertions about aspects of the testing process and intended measurement target, as well as how the crafting of interactions between examinees and testing materials and procedures should take place. The term argument derives from the notion of argument-based validation proposed by Kane (1992). A straightforward example of a validity argument is provided by Kane and illustrated in Table 2.4. The first column in the table lists the four elements of an argument related to validating the meaning of scores obtained on an algebra placement test that will be used for assigning first-year college students to either a calculus course or a remedial algebra course. The intended interpretation of scores on the test is the students’ degree of competence in algebra. The second column lists possible sources of evidence bearing on the assertion.
Table 2.4 Sample argument-based validation assertions Assertion
Possible evidence
1. Certain algebraic skills are prerequisites for the calculus course; students who lack these skills are likely to have great difficulty in the calculus course. 2. The content of the algebra placement test is well matched to the algebraic skills used in the calculus course.
Analysis of the content and methods of instruction in the calculus course would identify the specific algebraic skills used in the calculus course and their relative importance for the calculus course.
Analysis of the alignment between the placement test specifications and the domain of algebraic skills used in the calculus course, and analysis of the alignment between the placement test items and the test specifications. 3. Scores on the test are generalizable across Analysis of rater/scorer reliability; equivalent forms reliability analysis; samples of items, scorers, and occasions. test/retest analysis; generalizability analysis 4. There are no sources of systematic error that Analysis of possible sources of bias and the analysis of their impact on would bias the interpretation of the test scores scores on the algebra test (e.g., differential student motivation; as measures of skill in algebra. language load of the algebra test, appropriate testing accommodations not provided, etc.) Note: Adapted from Kane (1992).
ECD is a logical next step in argument-based validation, but adds distinctive components and substantial elaborations to the process. More specifically, the ECD paradigm comprises three aspects, or what are called models in ECD terminology, that are central to validation: • The Student or Competency Model. This aspect of ECD specifies the variables (the collection of knowledge, skills, or other attributes) that comprise what the test developer intends to measure (i.e., the intended assessment targets); • The Evidence Model. This aspect formally specifies the examinee behaviors that would logically (or theoretically) be expected to evidence different levels of the measurement target(s) • The Task Model. This aspect formalizes specifications for the tasks, items, simulations, or activities that can be developed to elicit the behaviors specified in the Evidence Model. In addition to the three components described above, the full ECD approach includes an Assembly Model that describes how the competency, evidence, and task models work in conjunction to accurately and comprehensively address the intended targets, and a Presentation Model that describes how the tasks will be presented to examinees (e.g., style, organization, mode of presentation). Both ECD and assessment engineering approaches are increasingly being used in the test development process to provide rich evidence of evidence based on test content. A wealth of background and procedural information on these orientations to test development and delivery can be found elsewhere. Readers interested in greater detail on these approaches should consult Luecht (2013) for information on the principles of assessment engineering; for
information on ECD, see Mislevy, Almond, and Lukas (2004); Mislevy and Haertel (2006), Mislevy and Riconscente (2005), and Mislevy, Steinberg, and Almond (2003). When and Why Is Validity Evidence Based on Test Content Important? In this section, the Standards category, Evidence Based on Test Content, was broadly defined. Overall, two conclusions seem appropriate. First, of the four potential sources of validity evidence covered in this chapter (i.e., Evidence Based on Test Content, Evidence Based on Internal Structure, Evidence Based on Response Processes, and Evidence Based on Relations to Other Variables), it is hard to imagine a test for which evidence based on test content is not a primary source of evidence and thus essential in any validation effort. Whether based on a curriculum review for an educational achievement test, a job analysis for a credentialing examination, or a review of literature and theory related to a psychological construct being assessed, the content of a test must be grounded in something—and evidence based on test content provides that grounding. Second, the gathering and evaluation of evidence based on test content must infuse the test development process from beginning to end. It is essential that evidence be provided that the test is drawn from and tightly aligned to the construct of interest as operationalized by a set of content standards, theoretical dimensions, or other relevant aspects of the intended measurement target. The clear intention of traditional methods for gathering the kinds of information listed in Table 2.3, as well as the emerging approaches of ECD and assessment engineering, is to provide evidence for the validity of score interpretations that begins with a clear statement of intended inferences, continues through the development of test specification and item development, and extends to score reporting.
Evidence Based on Internal Structure In broad terms, a total score on a test can be thought of as composite score over a set of test items or tasks, and where the total number of items or tasks can be broken into one or more identifiable groupings that are substantively and statistically meaningful. The internal structure of a test refers to how the scores on the individual items are theoretically and empirically related to one another, and the extent to which those relationships are consistent with the proposed interpretation of the test scores. The notion of dimensionality is central to investigation of internal structure. A full explication of dimensionality and methods for investigating it are beyond the scope of this book, but a concise and highly accessible summary is provided by Gessaroli and De Champlain (2005). The dimensionality of a test refers to the number of distinguishable knowledge areas, skills, abilities, or traits that contribute to examinees’ performances on a test. A test may be unidimensional if a single factor is hypothesized to underlie examinees’
performance on the test, or it is said to be multidimensional if more than one distinguishable component underlies performance; that is, if the construct is viewed as comprised of unique components. Gessaroli and De Champlain describe dimensionality quantitatively in terms of the local independence of the set of test items; more formally, they define dimensionality conceptually as the degree to which the structure of the matrix of examinees’ responses is consistent with the domain(s) hypothesized to underlie their performance (2005, p. 2014). The authors also point out that dimensionality is not only a test development concern related to the items comprising a test; they caution that dimensionality can vary if a test is intended for use across differing examinee populations. Inherent in both the conceptual and quantitative definitions is that the dimensionality of a measure is not merely investigated when responses to test items become available, but is grounded in theory, logic, and experience with the focal construct(s) to be measured, and explicitly articulated and purposefully addressed throughout the test development process. For example, a test being developed to measure the construct of bullying might be based on current theory that bullying in schools can be manifested either as physical harm toward another student, or as psychological harm (for example, through teasing, cyberbullying, and so on). As another example, a medical licensure test might be developed to assess knowledge of both pharmacology and medical ethics. Explicitly, the test developers would seek to develop a measure of the overarching construct by tapping the two dimensions (i.e., physical and psychological bullying or pharmacology and ethics). Such efforts would be evident in item development in that statements or questions would be developed for the scale that targeted each of the dimensions. Importantly, test developers would not only be seeking to derive inferences about the overall level of bullying students experience in school based on total scores on the bullying measure or inferences about overall health care competence based on the overall licensure test scores, but also to make inferences about the levels of each kind of bullying and competence within the separate areas of pharmacology and ethics based on subscores derived from responses to items intended to measure the separate dimensions. Another example of an explicitly multidimensional measure can be seen in the commonly used frameworks for developing educational tests whereby, for example, a fourth grade mathematics test is developed according to a blueprint that explicitly dictates items must be included in the total test that cover the five subareas of number sense, algebra, geometry, measurement, and data analysis and probability. Again, test developers would be seeking to support inferences about overall mathematics proficiency based on total scores, but also to support inferences regarding student mastery of the content in each of the five subareas. Of course, it is not necessary for a test to explicitly address more than one dimension of a construct. A test may be intentionally unidimensional, with all items/tasks developed to focus on a single construct of interest and intending to yield confident inferences based on overall performance across all items in a scale. For example, a test developer may consider the
construct reading comprehension to be unidimensional, when developing a test designed to measure that construct in adults, with no distinctions intended related to comprehension when reading different genres or types of text (e.g., reading for information, reading narrative text, and so on). This characteristic of a test—that is, whether it is developed to measure (only) a single construct or a constellation of related but distinct aspects of an overarching construct—is referred to as the test’s internal structure. Importantly, regardless of whether a test is developed to be uni- or multi-dimensional, evidence must be gathered bearing on the extent to which the intended internal structure was successfully accomplished. Evidence of Internal Structure can be gleaned in many ways. One source of evidence— and related to Evidence Based on Test Content—is the test development procedures; that is, there should be evidence that items in the measure were purposefully created to measure a single (or multiple) dimensions by those with the expertise to do so. Dimensionality is typically investigated via one of the many statistical methods that have been developed specifically to assess it. Because many tests are developed with the measurement of a single construct in mind, confirmation of unidimensionality is often desired. A detailed review of statistical methods for investigating the unidimensionality of a test is provided by Hattie (1985). Increasingly, many tests have been developed (or found to be) multidimensional. A concise and highly accessible introduction to the nature of multidimensionality with information on specific software available for assessing it is provided by Gessaroli and De Champlain (2005). Internal Consistency Analysis Somewhat rudimentary statistical information that can reflect the internal structure of a test is provided by indices commonly associated with reliability—coefficient alpha, Kuder Richardson Formula 20 (KR-20), and others. These indices can provide information on the extent to which a test is producing data that are unidimensional. In fact, these indices are sometimes referred to as measures of internal consistency. In essence, internal consistency indices such as coefficient alpha and KR-20 express the degree to which items co-vary as a potential indication that the items in a scale measure some common attribute. They express in a metric ranging from 0.0 (less unidimensional) to 1.0 (more unidimensional) the extent to which total variance in scores on a test can be attributed more to the covariation among items than to unique variation within items. Much like the tuning fork analogy described in Chapter 1, internal consistency indices will be high when all of the “tuning forks” (i.e., items) resonate together in a scale created to be sensitive to levels of a specified construct. Confirming evidence of unidimensionality, if intended, would be values of internal consistency estimates closer to 1.0. The use of internal consistency indices for investigating internal structure is limited, however. For example, a high alpha estimate gives a lower
bound estimate of reliability and some evidence of unidimensionality when applied to unidimensional data. However, if the scale is not unidimensional, it is unclear how to interpret a high alpha estimate as validity evidence. Subscore Analysis Another basic approach to assessment of the intended internal structure of a test is, for tests designed to measure multiple dimensions, to evaluate correlations among subscores (see Sinharay, Puhan, & Haberman, 2011). Confirming evidence that the internal structure of a test was multidimensional as intended would begin with examination of total scores on each subscale, comprising items developed to measure each dimension. These subscores, based on the purposefully more homogeneous grouping of items forming a subscale, should correlate strongly and as hypothesized with the total test score. Items within a subscale should, on average, correlate positively and more strongly with total scores for that subscale than with total scores on the other subscales or the overall total test score. Likewise, internal consistency indices should be stronger for the homogeneous groupings of items forming a subscale than would be the internal consistency estimate for a total test based on the comparatively more heterogeneous comprising the full scale. It should be noted that, because subscales consist of fewer items than a total scale, it would be important to take the fewer number of items in the subarea into account via statistical adjustment (e.g., Spearman-Brown formula) when comparing the internal consistency indices based on (shorter) subscales and (longer) total test scores. Factor Analytic Methods More sophisticated procedures also exist for gathering validity evidence that a test is producing results consistent with the intended internal structure of the test. Similar to the kinds of correlational analyses just described, exploratory factor analysis (EFA) can be used to investigate the number of unique, identifiable groupings of items (“factors”) comprising a test. Although EFA procedures would be appropriate for investigating the internal structure of a test, confirmatory factor analysis (CFA) procedures would ordinarily be more appropriate as CFA requires the data analyst to specify the number of dimensions or factors a priori and, presumably, the number of factors specified would be consistent with the intended internal structure of the test and would be confirmed in the CFA results. There are diverse ways in which EFA and CFA results are presented. One of the most common outputs from factor analysis is called a scree plot, named after the loose rocks that slide down the side of a mountain (the “scree”) forming a sloping mass at the base of the mountain. Scree plots are used with principal components analysis (PCA) which is a type of EFA. Figure 2.2 shows a hypothetical scree plot resulting from analysis of a test intended to be unidimensional. On the y axis are the eigenvalues resulting from the CFA analysis; the
large first eigenvalue associated with the first factor or component of a construct measured by a test compared to the smaller values for all the other components provides evidence that the test is measuring primarily a single, dominant construct and the intended homogenous internal structure of the test is confirmed. Whereas the results shown in Figure 2.2 would be encouraging and provide confirming evidence based on internal structure for a test intended to be unidimensional (such as the reading comprehension test described earlier), the same pattern of eigenvalues and associated scree plot would provide weak or disconfirming internal structure validity evidence for a test designed to be multidimensional (such as the health professions, bullying, or fourth grade mathematics tests described previously). A scree plot more consistent with a test developed to measure two distinct aspects of a construct is shown in Figure 2.3.
Figure 2.2 Scree plot for hypothetical test intended to measure a single construct
Figure 2.3 Scree plot for hypothetical test intended to measure two distinct aspects of a construct It would also be of interest to view the factor loadings that resulted from the analysis. Factor loadings are quantitative values that express the degree to which the items in a scale are associated with a given factor or component of the test. Supportive evidence based on internal structure would be obtained to the extent that items developed to measure one of the specified components or dimensions of a test were strongly associated with that factor and weakly associated with factors they were not intended to measure. Table 2.5 illustrates output (the varimax rotated factor matrix) from a hypothetical principal axis factoring of a 12-item scale developed to measure three aspects of instructional quality. Shown for each variable (i.e., item) in the scale are the rotated factor loadings, which represent how the variables are weighted on each factor. (These values also represent the correlations between the variables and the factor.) Let us assume that the scale was designed to address three aspects of instructional quality: Behavior Management Skill, Time on Task Orientation, and Higher Order Pedagogy Use, corresponding to Factors 1, 2, and 3, respectively. Overall, the results suggest that Items 1, 2, 3, 4, and 5 measure Factor 1 (Behavior Management); Items 6, 7, 8, 9, and 10 tap Factor 2 (Time on Task Orientation), and Items 11 and 12 measure Higher Order Pedagogy Use. These results would provide evidence in support of the meaning of scores on the scale to the extent that the identified items were, in fact, developed to address those aspects of the construct. Also of note in the results shown in Table 2.5 is some evidence that does not as strongly support the intended score interpretations. For one, if not already planned to be reverse scored, Item 7 would appear to also load on the Behavior Management factor. For
another, Item 9 appears to load strongly on each hypothesized factor, perhaps tapping a general instructional quality dimension and failing to distinctly measure one of the three hypothesized factors. Table 2.5 Rotated factor matrix for Instructional Quality scale Variable (item)
1 2 3 4 5 6 7 8 9 10 11 12
Factors 1
2
3
.762 .627 .711 .695 .578 .306 −.419 −.032 .384 .402 .098 .204
.023 .300 −.115 .179 −.231 .739 .727 .455 .555 .621 .210 .115
.231 .267 .314 .213 .198 .006 .264 .257 .421 −.136 .688 .752
Whereas procedures to conduct EFA and CFA analyses are common in most comprehensive statistical software packages (e.g., SAS, Systat, SPSS, etc.) and factor analysis modules are available for R, specialized software is also available for conducting dimensionality analyses. A listing of 20 dimensionality assessment software programs was compiled by Deng and Hambleton (2007). A review of some of the frequently used software programs for use with multidimensional data is found in Svetina and Levy (2012); this resource includes annotations regarding the estimation methods, output, limitations, and descriptions of emerging methods with promise for dimensionality analysis. Some of the most commonly used approaches include: • DIMTEST—a procedure developed by Stout and colleagues (1992) that conducts a basic hypothesis test of unidimensionality in a set of dichotomously-scored items; • Poly-DIMTEST (Nandakumar, Yu, Li, & Stout, 1998), which conducts a similar analysis for items that are polytomously scored; and • TESTFACT—can be used to assess uni- or multi-dimensionality with dichotomous or polytomous data (Bock et al., 2008). Readers who are interested in more detailed information on principal components analysis, factor analysis, structural equation modeling and other multivariate procedures that can be used in validation should consult a reference on the specific procedure. Alternatively,
work of Tabachnick and Fidell (2019) provide an accessible overview of a variety of multivariate methods. Differential Item Functioning Differential item functioning (DIF) analyses have been developed for the purpose of conducting research to identify test items that may advantage or disadvantage identified subgroups of test takers. DIF analyses typically involve performance comparisons between two subgroups, a subgroup often considered not to be disadvantaged by the items in a test (called the reference group) and a group that may be at risk for potential disadvantage (called the focal group). The most commonly used DIF procedure—a chi-square approach called the Mantel-Haenszel (1959) procedure—involves first creating strata from the reference and focal groups that are matched on overall standing on whatever construct is intended to be measured by a test, where the matching is typically based on total scores obtained on the measure. Then, for each item in a test, analyses of the performances of the reference and focal groups across all strata are performed. The null hypothesis is that the matched groups do not differ in their performances. DIF is shown when groups matched on overall ability perform differently on an item. A thorough summary of this approach to DIF is provided in Zwick (2012); an overview of chi-square and item response theory approaches to DIF is found in Clauser and Mazor (1998). DIF procedures are perhaps most commonly thought of as screening tools for reducing or eliminating biased test items. Of course, items that function differentially in subgroups may not be “biased” in the way that term is often used in non-technical contexts. For example, it might be predicted that items in an achievement test would be flagged if subgroups of examinees were formed that had been instructed in the content covered by the test using differing instructional methods. Clearly, DIF procedures have proved useful during test development for ensuring that the items in a test perform in a similar manner across subpopulations of interest. However, DIF analyses also bear on internal structure. For example, if a test is intended to measure a unidimensional construct that is considered to be similar across subpopulations, DIF analyses can be performed to confirm that assumption. To the extent that DIF is found, it may be an indication that the internal structure of the test is not unidimensional; that is, it may be an indication that, in addition to the intended construct of interest, a dimension of what is being measured by the assessment is group membership, ethnicity, treatment status, gender, or some other characteristic that is not the intended focus of the measurement. Structural Equation Modeling Finally, in addition to the basic factor analytic and other methods for investigating the dimensionality of test data mentioned above, structural equation modeling (SEM) provides
another alternative for investigating the components of a test and for providing validity evidence based on internal structure. SEM is a general statistical framework that incorporates as special cases many of the procedures described above. In particular, SEM combines elements of both CFA and Path Analysis, the latter of which is a regression-like technique for examining complex interrelationships among variables. In the context of internal structure, SEM builds on CFA by allowing researchers to specify and test relationships among the subdomains on an assessment. An example is shown in Figure 2.4. Panel A of the figure presents a path diagram corresponding to the three-domain factor model of Instructional Quality represented by the results shown in Table 2.5. The circles represent the hypothesized domains (factors), the squares represent the items (variables), and arrows pointing from the circles to the squares correspond to the bolded factor loadings in the table. The double-ended arrows indicate possible correlations among the domains. In Panel A, the direction of the relationship among the domains or their relationship to the overall construct, is not specified.
Figure 2.4 Hypothetical examples of structural equation modeling (SEM) analyses Panel B of Figure 2.4 presents an alternative model of internal structure in which the three domains are related hierarchically to the overall construct of Instructional Quality. This model posits that the relationships among the domains can be fully explained by their association with the overall construct. A third possible relationship is shown in Panel C of Figure 2.4. In this case, the domains are related to each other such that domain 2 (Time on Task Orientation) is a predictor of domain 3 (Higher Order Pedagogy Use), and domain 1 (Behavior Management Skill) is a predictor of domain 2 (Time on Task Orientation). One plausible interpretation of this would be that the relationship between behavior management skill and higher order pedagogy use is fully explained by time on task orientation; that is,
educators who better manage student behavior can spend more time on task, and consequently have more time to implement higher order teaching practices. The role of SEM is to allow researchers to specify and test such relationships among the factors that comprise a construct. The different models in Figure 2.4 can be tested against data, or tested against one other, in order to identify the model with the strongest empirical support—that is, SEM facilitates the collection of validity evidence regarding internal structure by examining the hypothesized relationships between identified items groupings and total test scores. SEM also allows for examination of relationships with external variables like group status, or criteria and predictors, making it a general purpose technique for addressing DIF and other types of validity coefficients addressed in subsequent sections of this chapter. Finally, a distinct advantage of using SEM over CFA, other factor analytic or basic correlational approaches is that SEM explicitly accounts for measurement error. A straightforward description of the use of SEM for investigating the dimensionality of a test, as well as an example of dimensionality analysis using the software program LISREL is provided in Zumbo (2005). In that resource, the author describes the use of SEM to investigate the dimensionality of a depression scale, the Center for Epidemiologic Studies Depression Scale (CES-D, Radloff, 1977). In addition to illustrating the use of SEM to confirm the unidimensionality of that scale, Zumbo describes how additional validity evidence can be obtained using SEM to demonstrate the invariance of the internal structure of a scale across subpopulations of interest (e.g., sex, age). When and Why Is Validity Evidence Based on Internal Structure Important? In summary, Evidence Based on Internal Structure addresses key questions related to validity. These questions include: (1) “How many domains or dimensions are there and how are the items related to the dimensions?” These questions are the hallmark of EFA. To estimate the number of dimensions, we can use scree plots. To understand how items are related to the domains, we want to know how highly correlated each item is with each domain score. In a factor analysis framework this is accomplished via factor loadings, although item– total correlations can also be used for this purpose and are more widely understood. (2) “If there is more than one domain, how are the domains related to one another?” (3) “What are the empirical and theoretical relations among subdomains: Are they correlated or uncorrelated? Is there a hierarchy in the sense that the overall construct is measured by the subdomains?” These types of question can also be addressed via EFA, but they are more convincingly addressed via CFA, which tests how well a theoretical model fits the data, and allows for comparison among different theoretical models. (4) “Is it desired to report scores for the subdomains or only for the overall construct? How
are the former related to the latter? Is the overall construct score the total of all the items score, or the average of the subdomain scores, or something else?” (5) “How reliable are the reported scores?” This can be addressed via traditional internal consistency reliability measures like alpha, or via more modern approaches based on item response theory or generalizability theory. Overall, when and why is Evidence Based on Internal Structure important? The answer to the first part of the question is: nearly always. Whether a test is explicitly developed to measure a single, unitary construct or to measure a composite construct hypothesized to be comprised of distinct components, it is important that the hypothesized structure be confirmed. Why is this so? As with all questions of validity, the issue involves the intended inferences and interpretations of test score. It would be misleading and promote inaccurate score interpretations if only a single total score for a test were reported, but performance on the test required distinct areas of knowledge, skill, or other attributes that were not strictly compensatory. That is, for example, medical licenses awarded based on moderate performance on the health professions examination described previously could well result in threats to public safety if candidates with high levels of ethical competence but low levels of pharmacological knowledge were judged to be equally deserving of licensure as candidates with moderate levels of both. The reporting of scores (and setting of performance standards) in this situation should provide information to users on both components. Conversely, if answering items correctly related to informational materials gives the same information about examinees’ reading comprehension as answering items related to narrative texts, then a single score would be warranted—supported by evidence of that unidimensionality. It would be misleading and promote inaccurate interpretations if subscores based on text categories were reported. For example, examinees with low scores in the informational text subarea but higher scores in the narrative text subarea might (erroneously) conclude that the examinees should gain additional practice with informational texts when, in fact, additional practice with any type of text would be equally beneficial. Finally, the internal structure of a test is related to the choice of the psychometric model that will be used to analyze examinees’ responses in a test, such as classical test theory and item response theory (IRT). For example, some IRT models (e.g., the one-parameter or Rasch model; see Andrich & Marais, 2019) assume a unidimensional construct is being measured; others explicitly account for distinct factors underlying examinees’ responses (e.g., multidimensional IRT or MIRT models; see Reckase, 2009). The internal structure of a test is also relevant to the calibration of test items, to the choice of test equating procedures, and to the investigation of the extent to which items perform differentially (DIF) in different subgroups of examinees.
Evidence Based on Response Processes As with all other potential evidence that might be gathered to support the intended inferences to be made from test scores, the starting point for considering evidence based on response processes is the explicit statement incumbent upon test developers regarding the intended score meaning. In contexts such as licensure and certification testing for professional competence, the primary intended score interpretation is typically the proportion of some domain that an examinee has mastered. In other contexts such as educational testing, the intended score interpretation may be the degree of skill a student has in comprehending written texts. In psychological testing contexts, the intended score interpretation may be the level of a client’s anxiety. As was asserted previously in this chapter, each of these contexts and intended score interpretations relates to a claim that examinees’ responses to test items reveal something about their standing on a construct and grounding of those claims in content-based evidence is essential. However, in some testing contexts, it is not merely examinees’ responses for what they represent vis-à-vis some domain or construct that is of interest; rather, it is how examinees came to those responses that may be implicitly of interest or explicitly part of the claim that is made (i.e., part of the inference that is intended). As stated in the Standards, “theoretical and empirical analyses of the response processes of test takers can provide evidence concerning the fit between the construct and the detailed nature of the performance or response actually engaged in by the test taker” (AERA, APA, & NCME, 2014, p. 15). Finally, in addition to gaining information to support score inferences based on examinee responses, response process information can be gathered from those engaged in the scoring of examinee responses. Additional supporting validity evidence is gained to the extent that the processes that raters or observers use when evaluating examinees’ responses or performances are found to be consistent with the intended interpretation of the scores. The area of response process analysis as a source of validity evidence is both recent and multifaceted. Comprehensive and current treatments of response process strategies are provided by Ercikan and Pellegrino (2017) and Zumbo and Hubley (2017). Brief descriptions of some common approaches to gathering response process validity evidence are described below. What Is a Response Process? Before considering strategies for gathering validity evidence based on response processes, it seems appropriate to first define that term. Response processes are the behaviors and cognitive processes that examinees who take any kind of test (e.g., achievement tests, surveys, situational judgment tests) engage in when they interact with the directions for taking a test, the test items or tasks, and the manner of indicating their responses. Response
processes refer to the ways in which examinees understand a question or task; the ways they conceptualize the problem or think about the task; and the ways they formulate solutions, develop strategies, and justify their responses. Often a student may record a correct answer to an item, but the inference that he or she engaged in appropriate cognitive strategies to arrive at the response can be conjecture at best. Evidence based on response processes can help determine whether those processes are indeed taking place. “Response processes” can also refer to the cognitive processes engaged in by those who rate or score examinees’ performances or answers to test questions. Validity evidence from response process analyses is gained when those interactions suggest that the processes engaged in by examinees (or raters) are consistent with the expectations of test developers vis-à-vis the construct of interest. Importantly, evidence based on analyses of response processes may or may not be a primary source of evidence in any given testing context: the importance of gathering evidence based on response processes depends on the specific intended test score interpretations. Consider for example a simple test of geography involving the five Great Lakes located in the northern Midwest United States (Lake Huron, Lake Ontario, Lake Michigan, Lake Erie, Lake Superior). Elementary school students are presented with a test that includes the following question: “On the five lines below, write the names of the five Great Lakes.” Now, to answer this question, some students might close their eyes and visualize the Great Lakes shown on the map in their geography textbook. Some students might attempt to recall the names of the lakes based on a trip through the Great Lakes states taken on a family vacation. Some students might invoke a memorized mnemonic device, where the acronym “HOMES” prompts recall by providing the first letter in the name of each of the Great Lakes. Although it is perhaps not totally accurate to state it this way, it is possible that the teacher may not care how the students arrived at the correct list of names, only that they did so. In such a case, the teacher would not be making a claim about any kind of cognitive processes the students might have engaged in; the only intended inference based on scores from the classroom test was that students had mastered the names of the Great Lakes. Now, let us contrast that context with the geometry problem illustrated in Figure 2.5. The sample problem is being considered for inclusion in a geometry test. In this case, a geometry teacher is explicitly interested in how students solve the problem; the intended inference pertains to examinees’ levels of geometric problem solving skill.
Figure 2.5 Sample geometry problem solving test item In the problem shown in Figure 2.5, examinees are asked to solve for the hypotenuse of the given right triangle. If the intended inference based on scores from a test comprising items like the one shown is that examinees have higher or lower levels of problem solving skill, then it is imperative to ascertain if the items comprising the test actually elicit problem solving. On its face, the item shown in Figure 2.5 would appear to tap problem solving. In addition, the directions indicate that the test takers are to solve the problem, and they direct test takers to use the Pythagorean Theorem. Many test takers are likely to have solved the problem as the test makers had intended via use of the Pythagorean Theorem; that is, by substituting the values for the sides of the right triangle into the formula: 32 + 42 = x2, and working out the length of the hypotenuse by adding the squares of 3 and 4 and then taking the square root of 25. However, it is possible— indeed, likely—that other test takers may have engaged in the problem differently. For example, some test takers may have taken a visual approach to the problem and reasoned that, if the right triangle is shown in appropriate scale, only options A or B represented plausible values for the missing side. Other test takers may not have even had to apply that minimal reasoning, answering instead on rote memory involving a “3, 4, 5” triangle encountered several times in class. In such cases, a geometry test item intended to measure problem solving, did not actually do so. Importantly for purposes of validity, a test comprised
of that and similar problems would yield inaccurate inferences about examinees’ problem solving skills. How Is Evidence Based on Response Processes Obtained? In an early measurement textbook, Cronbach provided a straightforward answer to this question: One of the most valuable ways to understand any test is to administer it individually, requiring the subject to work the problem aloud … The tester learns just what mental processes are used in solving the exercises, and what mental and personality factors cause errors. (1949, p. 54) Methods for investigating response processes have expanded considerably since Cronbach’s observation. Importantly, although learning “what mental process are used” remains a central focus of many response process studies, the targets of response process investigations have also expanded. For example, additional validity evidence based on response processes can be obtained by examining the extent to which respondents to achievement test items, survey items, and other stimuli understand the questions themselves in the manner intended by test developers. Response process analyses can also inform the extent to which test takers comprehend the directions that accompany test items or tasks, as well as the way in which they are to record responses to those items or tasks. Clearly, if test directions or the manner of indicating responses is not accurately comprehended by test takers, the inferences intended to be derived from their responses will be threatened. There are many ways of gathering response process information. It is beyond the scope of this book to describe in detail all of the possible procedures. However, an overview of some common procedures is available in Padilla and Benítez (2014). A listing of some of the available options and related sources is provided in Table 2.6, and brief descriptions are provided in the following paragraphs.
Table 2.6 Some procedures and resources for gathering Evidence Based on Response Processes Procedure
Example Resources
Focus groups Eye tracking Think-aloud protocols Cognitive interviews
Krueger (1994, 1998), Smithson (2007) Foster, Ardoin, & Binder (2018), Rayner (1998) Ericsson (2006), Lyons-Thomas (2014), Padilla & Leighton (2017), Leighton (2017) Castillo-Diaz & Padilla (2013), Collins (2003), Ericsson & Simon (1993), Sudman, Bradburn, & Schwartz (1996) Zenisky & Baldwin (2006) Kitchin (2001), Novak & Cañas (2008), Plotnick (1997)
Response time Cognitive mapping, concept maps Semantic network analysis “Show your work”
Doerfel (1998), Helbig (2006) Every K–12 mathematics teacher (2020)
INTERVIEWS AND FOCUS GROUPS
One extension of Cronbach’s recommendation is the focus group. Although there is variability in definitions of a focus group (Smithson, 2007), in general a focus group can be defined as a small sample of test takers (usually six to twelve participants) drawn from the target population that is assembled to participate in a facilitated or guided discussion on a specific topic. As a potential source of response process validity evidence, the moderated discussion typically includes asking members of the focus group about various aspects of a test under development. For example, focus group participants can provide information regarding their understanding of the (oral or written) directions to an instrument. They can help test developers understand how test takers interpret the questions in a test or scale and help to identify any inappropriate wording, ambiguity, confusion, or lack of clarity in the questions or tasks. They can help researchers understand the meaning ascribed to the response options in a survey (e.g., the frequencies they associate with options, such as Frequently, Often, or Rarely, the intensities associated with Strongly Agree, Agree, Disagree and so on); they can also help confirm that the provided response options completely and accurately capture the range of ways in which respondents experience a phenomenon, engage in an activity, etc. THINK-ALOUD PROTOCOLS AND COGNITIVE INTERVIEWING
Somewhat more intensive and individual-based than focus groups are think-aloud protocols and cognitive interviews. When a think-aloud protocol is used for obtaining response process information, a test taker is asked to simply verbalize his or her cognitions (i.e., to “think out loud”) as he or she encounters, contemplates, and responds to a test item or task (Ericsson, 2006). Importantly, rather than describing his or her thoughts, participants in think-aloud procedures are asked to verbalize exactly what they are thinking as they progress through an item/test (Ericsson, 2006; Leighton, 2017; Lyons-Thomas, 2014). According to Mazor et al.
(2008), advantages of intentionally not asking participants to justify or explain their thoughts include minimizing reactivity and maximizing truthfulness/accuracy of responses. In addition to their use in gaining response process information, think-aloud procedures have been used to investigate potential sources of and reasons for differential item functioning (DIF; see Ercikan et al., 2010). Similar to think-aloud procedures, cognitive interviewing represents a method for gaining information about how test takers encounter and process information in a testing situation. According to the Standards, “questioning test takers from various groups making up the intended test-taking population about their performance strategies or responses to particular items can yield evidence that enriches the definition of the construct” (AERA, APA, & NCME, 2014, p. 15). Questions to be answered in cognitive interviewing might include: (1) What do you think the directions are telling you to do? (2) What do you think this item is measuring? (3) Which of these item formats do you find most directly addresses your knowledge or skill in this content area? (4) What knowledge or experience did you rely on to respond to this item? (5) What things did you consider in choosing (creating) your response to this item? Detailed procedures for conducting cognitive interviews can be found in Ericsson and Simon (1993) and Sudman, Bradburn, and Schwartz (1996). In contrast to simply verbalizing cognitions using think-aloud techniques, participants in cognitive interviewing are asked to elaborate on their cognitions and respond to probes by a trained interviewer based on their verbalizations. According to the theory of test responses underlying cognitive interviewing, test takers engage in an iterative process comprising four, sometimes non-sequential phases. The four phases are listed in Table 2.7 and described in the context of responding to a multiple-choice question. Table 2.7 Four-phase model of cognitive process underlying cognitive interviewing, multiple-choice item context Phase
Description
I II III IV
Examinees encounter, interpret, and form understandings about the item and its demands. Examinees retrieve information necessary to answer the item. Examinees make judgments that allow them to integrate and evaluate the information retrieved. Examinees adjust their initial evaluations to the multiple-choice options provided and provide a response.
In addition to this strictly cognitive model, recent research on cognitive interviewing has added motivational, social and cultural dimensions (see Castillo-Diaz & Padilla, 2013). RESPONSE TIME
Analysis of the time that test takers expend responding to test items is both an intuitively
appealing and practical method for gathering evidence based on response processes to support validity claims for a group of test scores. Theoretically, items or tasks that are hypothesized to require greater cognitive demand should require greater time for examinees to process and respond to. In some situations, it might also be hypothesized that item difficulty indices for test items should be negatively correlated with the amount of time examinees spend on those items (Wang & Sireci, 2013). Response time information can also be used both as a strategy to support validity during test development (Zenisky & Baldwin, 2006) and as a source of evidence bearing on the validity of individual test scores when there are questions about examinee motivation (Wise, 2014), guessing (Wise, 2017), or cheating on tests (Boughton, Smith, & Ren, 2016; Liu, Primoli, & Plackner, 2013). In practice, the increasing use of computer-administered or computer-adaptive test delivery models has enabled response time data to be routinely collected and made it a widely available, convenient, and standardized data source for potential analyses related to response process evidence for validity. At the individual examinee level, procedures for using response time data to examine motivation, guessing, and cheating are fairly well established. However, as a method for gathering overall evidence of score validity, there remains work to be done and some researchers have questioned whether response time analysis methods will be able to produce the kind of validity information hoped for (see Li, Banerjee, & Zumbo, 2017). EYE TRACKING
Data derived from the movements of examinees’ eyes while viewing and responding to test items or tasks can be used as evidence of validity based on response processes. According to Rayner (1998), eye tracking (sometimes called “gaze interaction”) methods have a long history in social and behavioral research. Eye tracking is currently widely used in market research to investigate how much attention potential buyers pay to advertising, which portions of web pages they attend to, and so on. In education, eye tracking methods have been used most widely to study the act of reading (see, e.g., Bax, 2013; Foster, Ardoin, & Binder, 2018; Suvorov, 2015). They can be used as a source of response process validity evidence to ascertain which parts of a stimulus, an item, or other materials examinees attend to, how long they attend to those elements, the problem attack/solution paths they follow, and so on. The use of eye tracking data to support the validity of test score interpretations appears to be increasing. Eye movement data as a source of evidence is perhaps not as straightforward as other approaches; compared to other sources of validity evidence based on response processes, it is a more indirect source requiring some degree of inference to interpret. The gathering of eye tracking data involves the use of special technologies designed to follow and record the eye movements of examinees for the purpose of making inferences about the cognitive processes
they employ. The tracking of eye movements is accomplished via a technique called “pupil center corneal reflection” whereby light is directed toward the center of a respondent’s eye and light reflected from the cornea is recorded and measured to gain detailed, precise, and extensive data on gaze direction and duration. Eye movements are typically captured via different tools that have evolved to become unobtrusive and accessible. Among the commonly used tools are stationary devices mounted on computer screens, mobile (wearable) devices such as glasses, and virtual reality headsets. Despite its long history in reading research, eye tracking studies do not appear to be widely used in educational achievement testing, psychological assessment, or credentialing examination contexts. COGNITIVE MAPPING, SEMANTIC NETWORKS
Cognitive mapping, concept mapping, and semantic network analysis share the similar goal of understanding how individuals organize and relate objects, events, or concepts within a discipline (see Sowa, 2000, for an overview). Detailed treatment of cognitive mapping and semantic network analysis methods are beyond the scope of this book. However, both methods can provide insights into the cognitions of test takers in support of the claim that the cognitive processes intended to be tapped by a test are actually used by examinees. The variety of cognitive maps typically used in achievement testing contexts is described below; readers are referred to Doerfel (1998) and Helbig (2006) for background and techniques of semantic network analysis which is a related and often more sophisticated class of procedures for understanding relationships among concepts. In brief, a cognitive map (or, less formally, a “concept map”) is a representation of an individual’s cognitive knowledge about a given concept or set of related concepts (see Novak & Cañas, 2008; Plotnick, 1997). The map is typically a graphical representation that illustrates relationships between concepts via spatial positioning as well as by linking terms and routes. These aspects of a cognitive map are sometimes referred to as “nodes” (circles, boxes, or similar ways of representing concepts) and “links” (words used to express relationships among the nodes). Cognitive maps can be an important source of information that reveals a test taker’s schematic organization underlying his or her approaches and responses to an item or task. Figure 2.6 shows a basic cognitive map that illustrates spatial and relational aspects of concepts related to seasonal variation in sunlight. Figure 2.7 shows a cognitive map that would result from classroom application of concept mapping by a teacher who wished to understand student’s cognitive organization related to changes in matter. The figure shows, on the left, a list of concepts the students were to use to form their maps; on the right is the resulting map illustrating of the concept nodes (e.g., “physical changes”, “shape”) and linking terms (e.g., “involves”, “such as”). In both Figures 2.6 and 2.7, the maps provide representative expressions of the examinee’s knowledge of the nodes and the links between
them, but they also provide evidence of response processes because they suggest inferences regarding the cognitive processes that underlie examinee’s representational and associative decisions.
Figure 2.6 Sample concept map Source: Retrieved from: http://cmap.ihmc.us/docs/theory-of-concept-maps.php, “The Theory Underlying Concept Maps and How to Construct and Use Them” by Joseph D. Novak and Alberto J. Cañas. Used with permission.
Figure 2.7 Sample classroom concept map Source: Retrieved from: http://web.uvic.ca/~mroth/teaching/445/concS&C3.GIF © Wolff-Michael Roth. Used with permission. “SHOW YOUR WORK”
Finally, and consistent with Cronbach’s straightforward approach to investigation of response processes, is the ubiquitous method that is likely used in nearly every elementary and secondary school mathematics class. A review of teacher-made tests and quizzes in those classes would almost certainly reveal the routine use of three words accompanying the directions: “Show your work.” Those directions are an intentional strategy used to obtain evidence from students regarding the procedures they used for solving problems, and to provide evidence based on response processes with respect to the response process the teacher intended to be demonstrated; e.g., did a student produce a response based on a memorized solution, or did the student engage in critical thinking, creative problem solving, use higher-order thinking skills or a novel solution approach? Thus, use of the method can aid classroom teachers—and test developers during the piloting phase of development—in identifying routine and novel approaches to a problem, appropriate or inappropriate solution strategies used by students, and can help teachers gauge students’ understanding of the task. When and Why Is Validity Evidence Based on Response Processes Important? Validity evidence based on response processes is gathered to provide confidence that test
takers’ performances, responses, or products are generated in a manner consistent with the intended inferences to be made from the test results. Response process information is particularly important to gather when an intended inference pertains to the level of cognitive processing test takers are assumed to engage in during a test, or when a cognitive process itself is the construct of interest. And, response process information can support broader validity claims to the extent that subgroup analyses reveal that the cognitive processes engaged in do not differ across test takers defined by region, age, sex, language, education level, SES, or other relevant variables. At the beginning of this section, two examples were given, one describing a hypothetical classroom geography test over the names of the Great Lakes, the other a classroom geometry test involving the Pythagorean Theorem. The two examples illustrate that validity evidence based on response processes may not be necessary to gather in some testing contexts; it may be essential in others. The key to understanding the importance of collecting evidence based on response processes is the statement of intended score inferences and the claims that are intended to be supported based on the test scores. For claims such as domain mastery, little or no response process may be needed. If a test purports to provide information on physicians’ “clinical judgment,” or students’ “critical thinking skills”, or survey respondents’ “attitudes toward gender equality”, then it is imperative that evidence is gathered to demonstrate that the processes engaged in by test takers support those interpretations. In short, evidence based on response processes is needed whenever an assertion is made or an intended inference is suggested regarding the processes an examinee will engage in to respond to a test item or task.
Evidence Based on Relationships to Other Variables Evidence Based on Relationships to Other Variables is perhaps the category of validity evidence that comprises nearly unlimited possibilities. In general, the evidence is obtained by examining relationships between scores yielded by a test of interest and some other variable(s) external to the test. There are well-established kinds of relationships that can be examined, each of which is related to the specific intended test score inferences and the hypotheses about the anticipated relationships; this source also provides the opportunity for creative or novel validation activities suggested by theory about a construct and/or practice in a field. In the following sections, some common strategies for gathering Evidence Based on Relationships to Other Variables will be described, with the recognition that the universe of additional, unique, and potentially potent sources of evidence is vast, and limited only by the scientific and disciplinary insights of the researcher. At the center of—though not exclusively—many validation activities that examine relationships among variables is the descriptive statistical technique of correlation. In the simplest terms, the examination of relationships involves two variables, a predictor (aka,
“independent” or “input”) variable, x, and a criterion (aka “dependent” or “outcome”) variable, y, where data are collected from the same sample of test takers on both variables. In most validation contexts, the total score on a test (or a subscore) functions as the independent or predictor variable and another variable, hypothesized by logic, experience, or theory to be related in a specified way to the predictor, functions as the criterion variable. The emphasis added in the preceding sentence is significant: whenever the choice is made to investigate relationships among variables for the purpose of validating test score meaning, identification of one or more relevant and powerful criterion variables is challenging and must be done on a scientifically sound basis (as opposed to choosing variables of convenience). Whereas criterion-related validity is no longer regarded as a “kind” of validity evidence, evidence based on the relationship between a predictor variable and a criterion/outcome variable remains a prominent source of evidence to be mined in many validation contexts. Also remaining are the difficulties of identifying appropriate criterion variables. This concern has persisted for nearly 100 years; in fact, so ubiquitous, long-standing, and substantial is the concern that it has acquired its own label, aptly, “the criterion problem” (see Austin & Villanova, 1992). The “criterion problem” refers to the difficulty of identifying a criterion variable that authentically represents the intended outcome and the fact that candidate criterion variables that are easily obtained are often not good approximations of the actual outcomes of interest. Perhaps the most recognizable example of the criterion problem is identifying an appropriate outcome variable when validating scores on an admissions test such as the ACT or SAT (the tests used for admission to U.S. colleges). The intended inference from ACT and SAT scores is the likelihood that a student will experience academic success in college, with the intention that higher scores will be associated with likelihood of success, and lower scores predictive of lower levels of success. In this case, Total ACT Score or Total SAT Score would serve as the predictor variable, x, and—typically—Freshman Year GPA (FYGPA) would be used as a criterion variable, y. The problems associated with this choice of criterion variable, FYGPA, are probably obvious. For one, operationalizing “success” in college as FYGPA seems exceptionally narrow; many other variables beyond FYGPA—variables that are surely more difficult to collect—are related to success. For another, even considering FYGPA to be an acceptable criterion variable, there are surely other variables that contribute to that outcome beyond SAT or ACT scores, such as financial resources, personal goals, study skills, attendance, family and social support, persistence, motivation, choice of course(s), and a myriad of other variables. The criterion problem may be even greater in clinical psychological contexts. For example, it is easy to imagine researchers who are developing a new instrument intended to yield inferences about depression. Whereas the instrument development may have been strongly grounded in current theory of depression (accruing evidence based on test content),
that does not address the concern about a sound criterion for which a relationship could be investigated for validation purposes. One possible criterion may be patient self-reports, but is it certain that patients correctly diagnose their own symptoms as depressive (as opposed to anxiety, mood, etc.)? Another possible criterion may be physician or psychologist diagnoses, but again the use of that criterion variable assumes the “correctness” of those diagnoses. Another possible criterion to which the newly developed depression measure could be compared is patients’ scores on an already-existing, professionally accepted measure of depression; of course, the use of the existing measure assumes that it “correctly” diagnoses depression—not to mention the fact that if it does that job well, there is likely no need for the newly developed measure! Setting aside the difficulty of identifying an appropriate criterion, the possibilities abound for obtaining validity evidence based on relationships among variables. The following sections describe some common relationships that can be explored to support a test’s intended inferences. Concurrent and Predictive Evidence Once considered to be a “kind” of criterion-related validity, data from external measures administered at various time points and targeting the same construct of interest can provide strong evidence of validity for a predictor measure. Most often, such evidence is gathered during development of a new instrument or in support of an existing instrument; however, concurrent and predictive validity information can also be investigated over the lifespan of a test’s use. Concurrent validity evidence is obtained when data on two variables gathered at essentially the same time point are analyzed. To obtain concurrent validity evidence, it is common for data on the predictor variable—that is, typically, the total test score on a newly developed instrument, or the measure for which additional validity evidence is desired—to be correlated with data on the criterion variable—that is, another variable such as scores on an existing instrument, where both the predictor and criterion tests have as their target the same construct of interest. Data on both variables are gathered from a sample of examinees who take both tests at the same time. To clarify, it is of course not literally possible for examinees to take two tests at the same time. The aim of gathering concurrent validity evidence using the same sample of test takers who provide responses to both tests, however, is not only to ensure that the same construct is studied, but also that extraneous factors (such as potential differences in different samples of test takers and differences in performance on the criterion measure that might be attributed to the passage of time) are controlled. Thus, whereas it is not strictly possible for the same group of examinees to take two tests at once, concurrent validity evidence involves administration of the two tests in as close temporal proximity as feasible, or else counterbalancing of the test administrations can be used. That
is, Test A may be administered first and Test B second to one-half of the sample; in the other half of the sample, Test B would be administered first, followed by administration of Test A. An example of concurrent validity evidence involves the two portions of a typical driver’s license test—a written test covering driving laws, traffic signs, and so on, and a road test where a prospective driver actually drives a vehicle and is rated by a qualified examiner. A state’s bureau of motor vehicles is using both measures to target a single construct and support a single intended inference that might be termed safe driving potential, a construct that may be hypothesized to comprise two dimensions, a knowledge dimension and a skill dimension. In this situation, score on the road test—a test of actual driving skill—would be considered to be the criterion variable. Of interest is whether there is validity evidence supporting a newly developed written test (the predictor variable). If a strong association were present, that would provide support for the claim that the written test tapped the intended construct of safe driving potential; it might also provide support for requiring only the passing of a written test for subsequent license renewals, saving the state the time, personnel and fiscal resources, and safety concerns that would exist if all drivers were required to attempt a road test for each license renewal. To gather concurrent validity evidence in this situation, it is common for the two measures to be administered close in time and typically in a fixed sequence, with the written test administered first, followed by—and assuming passing performance—the road test. Total scores on the two measures—perhaps total raw score on a multiple-choice format written test, and driving examiner rating of the applicant on the road test—would be correlated and, as stated previously, the results would be interpreted in light of the hypothesis that underlies the data collection. Namely, it would be hypothesized that, if the written test were in fact tapping the same construct as the road test (i.e., safe driving potential), then scores on the two measures should be strongly and positively correlated. Of numerous possibilities, other examples of concurrent validity evidence would include: • investigations of the relationship between kindergarten readiness test scores (the x variable) and kindergarten teachers’ observations and ratings of students’ readiness (the y, or criterion variable); • investigations of the relationship between final course grades in a medical school curriculum (the x variable) and program directors’ ratings of the physicians-in-training medical knowledge (the y, or criterion variable); • investigations of human resource director evaluations of applicants’ letters of recommendation (the x variable) and interview committee evaluations of applicants (the y or criterion variable); and • investigations of the relationship between self-reports of team functioning (the x variable) and supervisor evaluations of employee teamwork (the y or criterion variable).
The gathering and interpretation of predictive validity evidence differs from concurrent validation methods only in that the criterion variable occurs later in time. The aforementioned example involving gathering validity evidence in support of the intended inferences yielded by college admissions tests is typical of a context where predictive validity evidence is sought. (It should be noted that, in this situation, it is highly unlikely that admissions decisions would be based only on the bivariate relationship between scores on an admissions measure and FYGPA. Rather, a host of predictor (i.e., x) variables might be gathered—for example, high school grades, number of extracurricular activities, high school class rank, letters of recommendation, rigor of high school curriculum, applicants’ statement of purpose, and so on—and a multiple regression approach would be employed using a constellation of variables to predict college success. Other examples of contexts in which predictive validity evidence would be gathered include: • investigations of the relationship between scores on mechanical aptitude measure (the x variable) and grades in mechanical engineering courses (the y, or criterion variable); • investigations of the relationship between scores on a Marital Compatibility Scale (the x variable) administered to engaged couples and the subsequent number of years of marriage (the y, or criterion variable); • investigations of the relationship between scores on a middle-school interest inventory (the x variable) and students’ subsequent career choices (the y, or criterion variable); • investigations of the relationship between scores on a Recidivism Potential Scale (the x variable) and number of subsequent incarcerations (the y, or criterion variable); and • investigations of the relationship between scores on a Spanish language placement test (the x variable) and students’ subsequent grades in their Spanish language courses (the y, or criterion variable). Finally, there is a well-known problem in predictive validation contexts that, in many critical contexts, complete information for a criterion variable is not available—and that data unavailability affects the magnitude of the observed relationship between the predictor and criterion variables. For example, let us consider the relationship between a predictor variable, Medical School Admissions Test score, and a logical, temporally distant criterion variable, patient evaluations of their physicians. The top panel of Figure 2.8 provides an illustration of the relationship between these predictor and criterion variables in a hypothetical population of medical school admissions test takers; the scatterplot shown in the top panel shows a strong, positive relationship between physicians’ scores on the admissions test and subsequent evaluations of performance by their patients. The estimated correlation—that is, the estimated predictive validity coefficient—in this case is .86. The direction and magnitude of that correlation would provide strong validity support for inferring likelihood of effective
medical practice from scores on the admissions test. There’s just one problem: it would be irresponsible—indeed, dangerous to public health and safety—to admit all applicants to medical school and observe their subsequent patient evaluations! Indeed, the very raison d’être of such admissions tests is to admit for medical training only prospective physicians who have a high likelihood of eventually becoming safe and effective practitioners. Thus, medical schools implement selective admissions procedures. Commonly, the procedures include admitting only candidates whose admissions test scores exceed a certain threshold. The bottom panel of Figure 2.8 includes a dashed vertical line representing a cut score of 510 on the medical school admissions test used to select applicants for admission. Because criterion data are only available for candidates who were actually admitted to medical school, the only data upon which to calculate the correlation is based on admitted students and the corresponding scatterplot reveals both the noticeable restriction in range of values on the predictor variable (i.e., the admissions test scores) between 510 and 530 and the markedly weaker linear association between the predictor and criterion variables. When the correlation is computed, but this time based only on the available data, the estimated correlation (i.e., the estimated predictive validity coefficient) for the situation illustrated in the lower panel of Figure 2.8 drops to .39. This commonly observed phenomenon is sometimes referred to as the selection paradox. That is, the use of a test that is strongly related to an important criterion in a population may appear to be only weakly or modestly related to the criterion when it is actually used for selection purposes because of the restriction in range on the criterion variable. A real-data example of the selection paradox is provided in Sireci and Talento-Miller (2006).
Figure 2.8 Illustration of restriction in range on predictor variable There are ways to address the problem of restriction in range. The most common approach is to estimate the variability in the population on the criterion variable and use a well-known formula for proposed by Thorndike (1949): ρxy=(rxy)(σysy)1−rxy2+ rxy2(σysy)
where: ρxy is the desired estimated relationship between the predictor variable, x, and the criterion variable, y, in the population; rxy is the observed relationship between the predictor variable, x, and the criterion variable, y, in the sample (i.e., after selection, affected by the restriction in range); sy is the observed standard deviation on the criterion variable; and σy is an estimate of the population (i.e., unrestricted) standard deviation. A number of other approaches to the restriction in range problem exist. Detailed summaries are provided in Sackett and Yang (2000) and Wiberg and Sundström (2009). Convergent and Discriminant Evidence Consistent with the principle stated previously that analyses of relationships among variables in validation work should be explicitly driven hypothesized associations based on logic, experience, or theory, the analyses of convergent and discriminant relationships can also provide valuable evidence in support of intended score meaning. The ideas behind convergent and discriminant approaches for obtaining validity evidence were proposed by Campbell and Fiske (1959). The current Standards describe convergent validity evidence as the “relationships between test scores and other measures intended to assess the same or similar constructs” and discriminant validity evidence as “relationships between test scores and measures of purportedly different constructs” (AERA, APA, & NCME, 2014, p. 17). To illustrate these sources of validity evidence, let us imagine that a researcher is developing a new measure of depression, which we might call Test X. The researcher wishes to establish support for the intended inference that higher scores on Test X indicate higher levels of depression, and lower scores on Test X indicate lower levels of depression. The researcher has taken care to define and operationalize for measurement the construct of depression according to current theory and clinical experience with the construct and its manifestations (i.e., Evidence Based on Test Content). To gather convergent validity evidence, the researcher could administer Test X along with one or more other measures intended also to measure depression as the researcher has defined it or in a similar manner. For example, in a representative sample from the population in which the researcher intends for Text X to be used, the researcher might administer both Test X and the Beck Depression Inventory-II (BDI-II; Beck, Steer, & Brown, 1996) or another instrument specifically developed to yield inferences about depression. A strong, positive correlation between test takers’ scores on the two measures would provide support for the claim that Test X also measures the construct of depression as measured by the BDI-II (or, more precisely, there would be support for the claim that the two measures, Test X and
the BDI-II measure the same construct). This type of validity evidence is called “convergent” because two measures, developed to yield inferences about the same construct, should produce scores that are consistent with each other; i.e., they “converge” on the same construct and support the same intended inferences. Although typically presented as a type of reliability evidence, a special case of convergent validity evidence is obtained when two measures of the same construct are produced by independent raters, as in the context where the ratings of two (or more) scorers who evaluate responses to an essay prompt are compared to a previously scored benchmark response; when judges’ ratings of performances are compared to criterion scores for the performances; or when comparing scores generated by qualified human raters to those generated by a computerized, automated scoring algorithm. Strong relationships among these pairs of data points, sometimes referred to as scoring validity, can provide support that the scores obtained can be interpreted as intended. To gather discriminant validity evidence, the researcher could administer Test X along with one or more other measures intended to measure constructs different from depression. The strongest discriminant validity evidence would be obtained by examining the relationship between scores on Test X and scores from another instrument that measures a construct similar to, but distinct from depression (and, again, administered to a sample from the same population as close in time as practical). The construct measured by the other instrument would be one identified from research and experience with depression as one frequently misdiagnosed, comorbid with depression, or exhibiting similar symptoms. For example, the researcher might administer both Test X and the Profile of Mood States-II (POMS-II; Heuchert & McNair, 2012) or the Anxiety Disorders Interview Schedule (ADIS; Brown, DiNardo, & Barlow, 1994). Weaker correlations than were obtained from the convergent analysis would provide support for the claim that Test X does not measure mood or anxiety and is able to discriminate between those constructs and depression. Ideally, the convergent and discriminant validity coefficients would be of magnitudes suggested in prior research, theory, or clinical practice. Combining Convergent and Discriminant Evidence As was just foreshadowed, convergent and discriminant validity coefficients can be systematically compared to provide a fuller picture of the body of validity evidence based on relationships among variables. Such an approach was described by Campbell and Fiske (1959) who introduced the idea of creating a “multi-trait, multi-method” (MTMM) matrix of relationships among various measures. The MTMM matrix comprises not only convergent and discriminant evidence, but other associations as well. Figure 2.9 shows a hypothetical MTMM matrix for an investigation of validity evidence for the hypothetical new measure of depression, Test X, alluded to in the previous section.
Figure 2.9 Hypothetical multi-trait, multi-method matrix Highlighted in Figure 2.9 are various relationships among variables following the notational conventions in Campbell and Fiske (1959). The hypothetical relationships illustrated portray correlations obtained (as above) among three traits: Trait A represents depression, the target construct of the newly developed Test X; Trait B represents anxiety; and Trait C is mood. The three traits were measured in samples from the target population of interest using three methods: Method 1 is the use of a written measure (e.g., multiple-choice, Likert format, etc.); Method 2 is clinical observation; and Method 3 represents client selfreports from standardized interviews. For illustration, let us imagine that Test X is a measure of depression (i.e., Trait A) that uses Likert-scaled item formats (i.e., Method 1). Thus, data obtained for Test X are represented by the combination of indicators, A1. The entries in the MTMM matrix shown in Figure 2.9 show the relationships among these variables (i.e., constructs) in different ways. First, the entries shown in parentheses on the main diagonal are the reliability estimates for the measures. That is, the relationship between A1 and A1—the reliability estimate for our test of interest, Test X—is shown as .91. This relationship may have been obtained via Cronbach’s alpha, test–retest correlation, or another
method for estimating the reliability of scores on the newly developed Test X. The other reliability estimates on the main diagonal are similarly high; as one would expect, scores on a test cannot correlate more strongly with any other variable than they do with themselves. For example, the reliability estimate for the clinical observations of anxiety is .89; the reliability of the self-reports of mood is .95, and so on. Six groups of correlations are enclosed in dashed lines. Using the terminology of Campbell and Fiske (1959), these are referred to as hetero-trait, hetero-method correlations. That is, the values in the dashed triangles are the result of correlating scores on measures of different constructs using different methods of measurement. For example, the relationship between the written test scores (Method 1) on Trait A (depression) and clinical observation ratings (Method 2) of Trait B (anxiety) is .17. Using Campbell and Fiske’s notation, these relationships would be represented as A1 and B2, respectively. The relationship between the written test scores (Method 1) on Trait A (depression) and clinical observation ratings (Method 2) of Trait C (mood), that is, the relationship between A1 and C2, is .12, and so on. Three groups of correlations are enclosed in solid lines. Again using the terminology of Campbell and Fiske (1959), these are referred to as hetero-trait, mono-method correlations. That is, the values in the solid triangles are the result of correlating scores on measures of different constructs using the same method. For example, the relationship between the written test scores (Method 1) on Trait A (depression) and written test scores (Method 1) on Trait B (anxiety) is .52; the relationship between the written test scores on Trait A and written test scores on Trait C (mood) is .42, and so on. These correlations illustrate what is sometimes referred to as “method variance” (see Brannick et al., 2010). Method variance refers to the association between scores on two measures that results from commonality in measurement procedure and not the construct that is the focus of the measures. Finally, the two lower diagonals (i.e., not in parentheses) are what are referred to as monotrait, hetero-method correlations. That is, the values along these diagonals are the result of correlating scores on measures of the same constructs using different methods of measurement. For example, the relationship between the written test scores (Method 1) on Trait A (depression) and clinical observation ratings of depression (A2) is .65; the relationship between the written test scores on the anxiety measure (B1) and clinical observation ratings of anxiety (B2) is .59, and so on. Not only the researcher’s selection of instruments and methods, but also the interpretation of this constellation of relationships should be guided by logic, theoretical expectations, and clinical experience. Importantly, in order to provide strong validity evidence for the newly developed test of depression, Test X—or in any validation context using a MTMM approach —the relative magnitudes of the associations should follow a predictable pattern: • as indicated previously, the strongest correlations will be those obtained when scores on
an instrument are correlated with themselves; that is, the reliability coefficients; • the next strongest relationships should be those obtained when scores yielded by a test intending to measure one construct are correlated with scores yielded by another test targeting the same construct (i.e., the mono-trait, hetero-method correlations), regardless of the measurement method(s) used by the two tests; • weaker relationships should be observed among scores from tests measuring different constructs, but using the same measurement approaches (i.e., the hetero-trait, monomethod correlations); and • the weakest relationships should be observed among scores obtained using different methods and from tests targeting different constructs (i.e., the hetero-trait, hetero-method correlations). Finally, although it has the potential to provide strong validity evidence based on relationships among variables, the MTMM approach does not appear to be used frequently in validation efforts. Likely, that is due to the burden placed on a single set (or randomly equivalent samples) of examinees. In the relatively simple hypothetical examples shown in Figure 2.9, examinees would be required to take a total of nine assessments. As will be discussed in Chapter 6, the desire to compile the strongest possible validity case of support for the intended meaning of scores yielded by an instrument must often be balanced with other factors such as time, cost, burden, reasonable alternatives, and others. When and Why Is Validity Evidence Based on Relations to Other Variables Important? In nearly every measurement context, it is possible—indeed, quite easy when those engaged in the validation effort have familiarity with the theory and research, and experience with the construct under study—to identify external variables to the test that can be hypothesized to be related to scores on the test in predictable ways. Regrettably, and often used by default and unrelated to theory-based expectations are what might be called “the usual suspects.” These would include easily collected demographic variables such as gender, ethnicity, age, grade level, income, or other such variables. On the one hand, in many cases these variables may be theoretically relevant and may be a source of strong validity evidence. On the other hand, it may be that there are no strong logical or theoretical rationales for their inclusion in a validation effort. For example, when developing a measure of a psychological construct where theory and the consistent body of research evidence show females to have, on average, a higher standing on the construct, it would be a source of disconfirming validity evidence if a validation effort for a newly developed measure revealed that scores on the measure were strongly related to being male. Whereas socioeconomic status (SES) may be a relevant variable for inclusion when validating scores on a newly developed expressive language measure (under the hypothesis that expressive language is greater in persons with greater
access to language-based materials such as magazines, e-books, and so on), it may be only of minimal interest or relevance for inclusion when validating scores yielded by an instrument designed to measure creativity. This would be the case under the hypothesis that creativity is unrelated to SES, although if relationships among those variables were examined and no association was found, that finding of no association could serve as supporting validity evidence. As stated previously, the variables chosen for examination when gathering validity evidence based on relationships among variables should be hypothesis-driven, and chosen for their logical, theoretical, or practical value. Across the fields of psychology, education, and credentialing, identification of relevant external variables is limited only by the theoretical and logical savvy of those engaged in the validation effort. Grades in other subject areas, scores on tests of the same or related constructs, supervisors’ ratings, political party affiliation, years of experience, course grades, hours of professional development experience, program director evaluations, letters of recommendation, self-reports, number of at-fault accidents on the job, standardized observations of playground behavior, region of the U.S., parental education, religious preference: the list of potential variables external to scores on the test being validated is essentially unlimited. In short, validity evidence based on relationships to other variables—even if only variables internal to a test such as scores on subscales—should be collected as part of nearly every validation endeavor. Such evidence can provide strong, theory-grounded support for intended test score interpretations.
Conclusions This chapter described six areas of consensus in modern validity theory. Overall, there are many areas of consensus—far more areas of agreement than areas where measurement specialists disagree. The main points of agreement include that: • • • • •
validity pertains to test score inferences; validity is not a characteristic of an instrument; validity is a unitary concept; validity is a matter of degree; validation involves gathering and evaluating evidence bearing on intended test score inferences; and • validation is an ongoing endeavor. In addition to consensus on these fundamental principles of modern validity theory, there is broad agreement regarding four major sources of evidence that can be searched out for potential support for the inferences that are intended to be made from scores on a test:
Evidence Based on Test Content, Evidence Based on Response Processes, Evidence Based on Internal Structure, and Evidence Based on Relationships among Variables. Although the specific potential sources of validity evidence within these categories are myriad, the basic principle of modern validity theory still applies: all sources of validity evidence ultimately bear on the level of confidence that is possible regarding the interpretations of scores on a test with respect to examinees’ standing on the construct of interest. Or, in common shorthand usage: all validity evidence is construct validity evidence. And, although not described in this chapter, there is broad agreement that evidence of the stability, consistency, reproducibility of scores yielded by a test—that is, evidence of reliability—is a necessary prerequisite to undertaking validity evidence gathering. It makes no sense to speak of the meaning of scores that are not firstly dependable indicators of something! Finally, one comparatively minor point of disagreement exists regarding a fifth potential source of validity evidence identified in the Standards for Educational and Psychological Testing; that source is “Evidence Based on Consequences of Testing” (AERA, APA, & NCME, 1999, p. 16) or “Evidence for Validity and Consequences of Testing” (AERA, APA, & NCME, 2014, p. 19). Evidence based on consequences of testing has long been controversial as a possible source of validity evidence. Although there are situations in which such evidence might bear on validity, in the vast majority of cases the term “consequential validity” is misused to refer not to evidence bearing on score meaning (i.e., validity) but to evidence supporting or militating against the use of a test—regardless of the wealth (or lack) of support for the intended meaning of its scores. In noticeable understatement, the 1999 edition of the Standards acknowledged this controversy, observing that disagreement about the incorporation of the intended and unintended consequences of test use into the concept of validity is “an issue receiving attention in recent years” and suggesting that “it is important to distinguish between evidence that is directly relevant to validity and evidence that may inform decisions about social policy but falls outside the realm of validity” (p. 16, emphasis added). In the next chapter, the roots and dimensions of the controversy over consequences will be briefly reviewed. As will be shown, although it is not possible for consequences of testing as typically imagined to bear on the intended interpretations of test scores—that is, on validity —consequences play an important role in a comprehensive approach to defensible test score meaning and use. That role of consequences will be more fully explored in Chapter 5. For now, we turn to tracing the controversy of consequences, the reasons why it cannot ordinarily be considered to be validity evidence, and the infrequent but possible situations in which it may have potential to shed light on score meaning.
3 VALIDITY AND THE CONTROVERSY OF CONSEQUENCES
The matrix was a mistake. (Shepard, 1997, p. 6)
Although broad agreement exists about the importance of validity and major tenets of modern validity theory, disagreement exists regarding the definition and boundaries of the concept, and regarding the sources of validity evidence that are desirable or necessary for sustaining defensible inferences. This chapter focuses on perhaps the most controversial aspect of modern validity theory: the role of consequences. It will be shown that, in nearly all situations where it is referenced, consequences of testing does not—indeed, cannot—bear on validity at all. Rather, consequences of testing are a distinct and important concern in their own right that must be accounted for in a comprehensive approach to defensible testing practice.
Roots of the Problem of Consequences in Validity Theory Almost immediately following the publication of Messick’s (1989) treatise on validity, a novel aspect of validity he introduced prompted debate about precisely how the concept of validity is circumscribed. In a simple 2 x 2 matrix, Messick presented what he referred to as four facets of validity. The matrix comprised four cells at the intersections of “test interpretation” and “test use” on one dimension and “evidential and consequential bases” on the other. Whereas some of the facets (i.e., cells) captured non-controversial aspects of validity (e.g., construct validity), the facet at the intersection labeled “consequential basis of test use” has proven to be controversial. As Messick defined it, the consequential use facet of validity requires “the appraisal of both potential and actual social consequences of applied testing” (1989, p. 20). The facet has come to be referred to by the shorthand, consequential validity. It is important to clarify that Messick himself did not actually use the term consequential validity in his 1989 chapter, but referred instead to “the consequential basis for test use.” Regardless, the concept is widely attributed to him as derivative from that influential work and the term consequential validity has permeated the literature in assessment, education, and testing policy.
Adding to confusion about the origins of consequential validity is that Messick, in some writings, appears to embrace the incorporation of consequences into a theory of validity, while in other places he appears to reject incorporation. As an example of the former, one of his later writings focuses singly on supporting the notion that “empirical consequences of test interpretation and use constitute validity evidence” (1998, p. 35). As an example of the latter, Messick argued elsewhere that consequential validity applies only to situations in which “any negative impact on individuals or groups [derives from a] source of test invalidity, such as construct underrepresentation or construct irrelevant variance” (1995, p. 746). In the end, it is not necessary for current purposes to definitively determine the heritage of what has come to be called “consequential validity”, nor is the purpose of this chapter to impugn Messick for incorporating consequences into validity theory. Regardless of its origins, “Evidence Based on Test Consequences” became firmly entrenched as a source of validity evidence with its inclusion in the 1999 edition of the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1999, p. 16). Although incorporation into the Standards may have heightened the controversy surrounding consequences of testing as a source of validity evidence, many testing specialists called attention to the error well before publication of the 2 x 2 matrix in Messick’s (1989) chapter and the 1999 Standards. As early as 1966, Tenopyr commented that “to speak of ‘consequential validity’ is a perversion of the scientific underpinnings of measurement” (p. 14). In the 30 years since the formal introduction of consequential validity, tumult surrounding the concept has persisted. Referring to the controversy about the concept of consequential validity, some validity theorists have described the situation in measured terms. For example, one has noted that “consensus has not been achieved on what the role of consequences in validation should be” (Kane, 2001, p. 328), and another observing that “the [consequential validity] movement has been met with some resistance” (Zumbo, 2007, p. 51). Others have phrased the situation more starkly: “The most contentious topic in validity is the role of consequences” (Brennan, 2006, p. 8). Its contentiousness may be due to many factors. At minimum, the scientific rationale underlying the concept of consequential validity remains murky and professional practice employing the concept remains rare. The following sections detail the theoretical and practical problems associated with the concept.
What Is Consequential Validity Anyway? When attempting to unpack the concept of consequential validity, one of the first problems encountered is its initial explication. Messick’s (1989) chapter on validity, as might be expected of any meaty manifesto, has been the subject of interpretation and reinterpretation, although in some measure the attention may be due to the complexity of Messick’s original formulation. One commentator attempting to critique the chapter observed that “questioning
Messick’s theory of validity is akin to carving a Thanksgiving armadillo. Each point in his exposition is articulated through quotations which deflect concerns toward cited authors” (Markus, 1998, p. 7). Other theorists have attempted to deconstruct Messick’s formulation (see Moss, 1998; Shepard, 1997) and Messick’s own subsequent work (1995) attempted to elaborate the theoretical place of consequences by translating the four facets of validity into six aspects of validity. To date, these efforts have failed to clarify the original theoretical formulation, to extend it in any appreciable way, or to diminish the controversy. Neither Messick nor any advocates for the concept of consequential validity have provided a concrete definition of that concept. Nonetheless, consequential validity has taken root among testing policy makers, educational researchers, some measurement specialists, and others. It was perhaps predictable that, if the consequences camel were permitted to poke its nose inside, it would only be a matter of time before it would claim ownership of the whole tent. Shepard’s (1997) description of the circumstances under which she believes test consequences must be considered in validation efforts is telling: According to Shepard: “It is possible to appraise the construct validity of a test without considering test use so long as no use is intended” (p. 6, emphasis added). The sentiment expressed in that statement implies a privileged place for consequential validity above other potential sources of validity evidence. In another source, a privileged status for consequential evidence is plainly asserted. Following shortly after Messick’s introduction of the concept, a document entitled Standards for the Assessment of Reading and Writing produced by the International Reading Association and the National Council of Teachers of English asserted that “The consequences of an assessment procedure are the first and most important consideration in establishing the validity of the assessment” (1994, p. 17, emphasis added). Shortly thereafter, evidence about consequential validity was inserted as the only named, required source of validity evidence for state student achievement testing programs under the No Child Left Behind Act (NCLB; 2002) legislation. According to the Peer Review Guidance issued for mandated statewide student achievement testing programs under NCLB: “In validating an assessment, the State must also consider the consequences of its interpretation and use” (U.S. Department of Education, 2007, p. 39). Interestingly, under recent reauthorization of the Elementary and Secondary Education Act called the Every Student Succeeds Act (ESSA; 2015), the specific requirement for consequences was not included in the peer review guidance (U.S. Department of Education, 2018). In summary, although never precisely defined, a general definition of the term consequential validity can be gleaned from the works cited previously in this section. A tentative definition might be that consequential validity is a source of validity evidence derived from the intended and unintended consequences of the use of a test. It may be that the current controversial status of what has been called “consequential validity” is attributable to
lack of a precise definition being developed over time, or to an initial explication that was sufficiently problematic so as to simultaneously foster confusion about how consequences of test use fit into a theoretical formulation of validity and permit claims of primacy as an evidentiary source. Finally—and again despite the lack of a precise definition—there has been equally passionate advocacy for consequential validity as there have been logical questions about its very existence and whether consequences of testing bear on validity at all. After more than 30 years, what remains is a controversial formulation of validity in which the notion of consequences of testing as a source of validity evidence is a present but contested feature. The remainder of this chapter weighs in on the controversy. In the following sections, the case is made that what is called consequential validity is not—cannot be— a source of validity evidence because of clear theoretical and logical flaws associated with incorporating concerns about consequences of test use with concerns about establishing score meaning. The specific conceptual and practical problems posed by the attempt to subsume test consequences into validity theory are examined in detail and it is concluded that consequences of testing, or consequences of test use, or so-called “consequential validity” must be rejected as a source of validity evidence if the controversy is to be resolved and progress is to be made toward a comprehensive framework of defensible testing.
Three Conceptual Problems with Consequential Validity There are at least three conceptual problems with attempting to incorporate consequences of testing as a source of validity evidence under a broader theory of validity. Individually, each of the problems may present only a minor irritation or moderate concern; taken together, they suggest that the incorporation of consequences poses larger, potentially intractable problems for validity theory that require a comprehensive remedy.
1. The Definitional Flaw Including consequential validity as an aspect of validity theory contradicts the concept of validity itself. Referring back to the first point of agreement regarding validity, it is noteworthy that validity is defined in terms of accuracy of test score based inferences. In Messick’s words: “To validate an interpretive inference is to ascertain the degree to which multiple lines of evidence are consonant with the inference, while establishing that alternative inferences are less well supported” (1989, p. 13). Thus, if the very definition of validity pertains to score inferences, any extensions of the concept must be consistent with that definition. Incorporating the consequences of test use is not. The crisp distinction between the intended inference to be made from a test score and the subsequent consequences of how a test score is used is easily illustrated with two examples.
Let us consider first a hypothetical example involving the discovery of a highly accurate blood test. The test is intended to reveal the presence of a certain marker that portends the onset of an incurable disease that results in an impending, certain, and agonizing death within a very short term. As use of the test grows, however, researchers also begin to observe a rise in the incidence of suicide among persons who have been informed that the test has yielded a positive result. Three things are clear from this scenario. First, it would surely be a matter of ethical concern to physicians whether the test should be used, given the rise in suicides. It might be decided, for example, to discontinue the use of the test, or that the reporting of test results to patients should be accompanied by counseling and follow-up. Whatever actions are taken following testing, it is clear that the consequences of the use of the test are important to consider. Second, although the hypothetical situation presents as a given the accuracy of the blood test, the example implies that attention to inaccurate decisions is also important. That is, when categorical decisions result from tests (in this case, illness/no illness), the implications of false positive classifications (i.e., incorrectly informing a patient that he or she has the disease) and false negative decisions (i.e., incorrectly informing a patient that he or she does not have the disease) must be carefully weighed. Finally, and most germane to a reconceptualization of validity that will be more formally presented in Chapters 4 and 5, the example illustrates that the accuracy of a test—that is, the validity of test results—is unaffected by the consequences of its use. The rate of suicides associated with the use of the blood test could fall, skyrocket, plunge to zero, remain unchanged, be limited to a specific age group, sex, ethnicity, educational level, and so on … but none of those observations would have any bearing on the meaning of the test result. This much is clear: whether the person has the disease or not is not affected in any way by what happens subsequently to learning that information; that is, the meaning of the test result is unchanged by information on the consequences of using the test. With the exception of some very infrequent situations described later, information about the consequences of using a test does not cycle back to bear on the validity of score inferences. Analogous but not hypothetical life-or-death consequences of testing have been reported to exist in educational testing today. For example, in Korea, the College Scholastic Ability Test (CSAT) is an examination taken annually by nearly 600,000 students. The test is considered to to be a consequential examination that permits or denies access to a wide variety of opportunities in Korean society. On the day the CSAT is administered, “measures [are] taken nationwide to help students take the exams without hitches,” including national prohibitions on parking within 200 meters of test sites, restriction of air travel and changes in flight schedules to reduce noise at test sites, and changes in the work hours for public and private employees “to prevent traffic jams in the early morning so that test takers can get to the exam venues in time.” It has also been reported that “temples and churches [are] packed
with parents praying for good performances” and that “the high pressure of preparing for the CSAT contributes to the high suicide rates of Korean teenagers” (Rahn, 2008; South Korea’s College Admission Test, 2008). In contrast to the consequences of the hypothetical blood test, the suicides associated with CSAT performance are real. Both situations are cause for concern. However, just as in the blood testing scenario, regrettable events that may be a consequence of administering a test do not change the validity of the scores obtained from those tests. A third example recalls many current debates about test consequences in K–12 education. Suppose that a test purported to assess the attainment of basic operations involving fractions in the population of third graders. The intended inference to be made from scores on this test is that high scoring students have mastered basic operations with fractions; low scoring students have not. Further suppose that the development and administration processes for the test followed all best practices of the testing profession: that is, the test was developed to tightly align to the content standards in the domain; editorial, bias, sensitivity, and appropriateness reviews were conducted; the test was refined via field testing, administered appropriately, scored accurately, and so on. Now suppose that the test was used in a way never anticipated by the test developer; say, by a state that wished to implement a scholarship program to provide disadvantaged students with opportunities to attend special college preparatory schools. Poor performance on the test would result in a lack of access to educational opportunities for some students—a negative consequence. It is precisely such a negative consequence that has fueled much of the attention to test use, and the corresponding desire to elevate that concern by incorporating it into the “most fundamental consideration in developing and evaluating tests,” namely, validity. As concluded previously, however, there is a clear obstacle to the incorporation effort: the use of data from this test to award scholarships has no relationship to the intended inference and therefore cannot bear on validity. Assuming the test development and administration as described, the test would yield highly accurate information about third graders’ mastery of operations with fractions. Such information might be deemed relevant by a variety of legitimate users of the test: for example, by a third grade teacher seeking to measure student achievement in that area, by a fourth grade teacher wishing to gauge important prerequisite skills, by a middle school teacher to assess readiness to learn algebraic concepts, and so on. Indeed, a range of uses of the test data is possible. Surely, not giving scholarships to some students, not promoting some students to fourth grade, mandating summer school for low performance, and other possible uses might represent unfortunate outcomes for some students. Less surely is whether any consequence is universally or objectively viewed as negative. Whether a consequence is positive or negative is not a straightforward matter, but involves the application of values and cost-benefit calculations. For example, whereas retention in grade might be seen as a negative consequence by a retained student or by those
who object to the retention policy, it may also be viewed as a positive consequence by those who reject social promotion or (perhaps eventually) by a student who received needed instruction in lacking knowledge and skills that enabled the student to successfully earn a high school diploma and who otherwise may not have persisted. The first important point here is that regardless of whether the aforementioned consequences or uses of the test scores are considered to be positive or negative, those uses would still be based on accurate (i.e., valid) information about the characteristic the test was intended to measure. The second relevant point is that the consequences of those uses are not relevant to judgments about whether the test does, in fact, yield accurate inferences about mastery of the intended mathematics content. Admittedly, the consequences of test use are important—just not to validity. The preceding scenarios illustrate the essential point that the consequences of test use are not relevant to or associated with the accuracy of the intended inferences. The accuracy of claims regarding mastery of fractions is the same for test takers who receive or do not receive scholarships; the score interpretations are the same for students who are promoted or retained; the intended inferences about mathematics learning are equally accurate for students from whom instruction in algebraic concepts is provided or withheld. In sum, if the theoretical essence of validity is grounded in the accuracy of inferences based on test scores, then validity must logically be agnostic as regards consequences. It is an interesting aside to note that Messick defended his theory against the logical problem in the very definition of consequential validity described in this section, although the defense is notable for its circularity. Referring to criticisms proffered by Mehrens (1997), Popham (1997) and others, Messick noted that: Opponents of consequences as validity evidence find it easy to argue that evidence supporting the accuracy of score inferences about a person’s current status on a construct is separable and orthogonal to the consequences of test misuse … The argument is easy because it is basically true but immaterial … Even on its own terms, this argument is deficient because validity refers not just to the accuracy of score inferences but also to evaluation of the appropriateness, meaningfulness, and usefulness of score inferences. (1998, p. 41) In analyzing Messick’s response, it is noteworthy that he concedes the objections to be “basically true.” And, although the criticisms are rejected as immaterial, the sole basis provided for the rejection is an appeal to the authority of a definition of validity that is Messick’s own creation (see 1989, p. 13). Thus, in addition to lacking an accepted definition, a persuasive defense for incorporating consequences into the definition of validity is also
lacking, and critiques of that incorporation have not been satisfactorily addressed.
2. The Temporal Flaw The second problem associated with consequential validity centers on the temporal aspect of validation. On its face, it would seem unethical to use a test for which little or no validity evidence had been gathered. This is why the building of the validity case necessarily begins at the onset of test development. However, consequences, by their nature, occur ex post facto. Important social consequences of testing are often not discernible until many years after a test has been in use—with consequences attached. Thus, if we are to take seriously assertions regarding the primacy of consequential validity (see International Reading Association and National Council of Teachers of English, 1994), the problem is clear. Adequate validity evidence should be gathered before a test is used, but if consequential validity evidence can only be gathered after a test has been used, then no validation effort can be deemed adequate until the data on consequences is available. If consequential validity is an essential aspect of validation, no test score inference can be judged to be adequately validated in the absence of consequences data, but no consequences data can be gathered without first administering and observing the consequences of an instrument that, by definition, lacks adequate validity evidence. The temporal conundrum here is clear: because consequences occur after a score interpretation has been validated, it is logically impossible for those consequences to be a part of the validation. The presentation of the temporal problem must admit two caveats. First, this problem is not unique to the gathering of information about consequences. The same issue can arise in the case of gathering predictive validity information (e.g., when tests are used to inform hiring/promotion decisions or in the context of college admissions testing). In these situations, information bearing on predictive validity is gained from examination of the results of using the test in a consequential manner. However, in contrast to information about consequences, it is possible—and indeed often the case—that during test development, information about the predictor/criterion relationship is available such that, when it is actually used in practice, some evidence can already have been amassed that permits confidence in the intended inference. No such a priori data on consequences can be gathered because the consequences would not have occurred. Second, and from a procedural perspective, validation is not a neat, linear endeavor; the building of a case for validity of score inferences begins in the nascent phases of test conceptualization and development. And, after a test is administered, additional evidence bearing on score meaning can (and should) be gathered that will further inform test development and judgments about validity. Although non-linear and iterative, instrument design and validation often do follow a familiar course: conceptualization, development, administration, evaluation. Although the temporal problem adhering to consequential validity
is clear (that is, information on consequences inherently comes after administration of the test), it is surely possible to at least consider potential consequences during earlier stages, such as during test conceptualization and development. However, whereas it might reflect enlightened practice, the consideration of potential consequences at the conceptualization and development stages still would not provide information that would bear on the intended score inferences, and thus precludes consequences of test use from inclusion as a source of validity evidence.
3. The Causal Flaw The third issue with incorporating consequences into validity is the obvious problem that, although some consequences of using a test may be plausibly anticipated, unintended consequences of test use cannot always be foreseen (hence the label “unintended”). Thus, unintended consequences cannot be investigated prior to test use. However, even allowing that some unintended consequences of test use can be anticipated, the difficulty of ascribing causation—a challenge endemic to all social science research—remains. It is perhaps useful to recall previous illustrations involving a blood test and a third grade test. In the case of the blood test, researchers might observe occurrences or even gather data that permit them to hypothesize a link between a positive test result and suicide, but that is different from establishing that one caused the other. Analogously, it would be equally difficult to establish a causal relationship between poor performance on a test covering fractions with (ultimately) reduced educational or job-related opportunities. The term consequential validity carries a strong causal implication; namely, that some event following the administration of a test was a consequence—that is, caused by—the use of the test. As Reckase (1998) has noted: The definition of a consequence is “the effect, result, or outcome of something” … This definition implies that there is a cause and effect relationship between something that occurred earlier than the result. It is very difficult to demonstrate a cause and effect relationship, even under carefully controlled experimental conditions. Such controls are typically not present in testing programs. (p. 14) Reckase concluded that, ultimately, consequences cannot be incorporated into validity theory if only because “the evaluation of unanticipated consequences in any formal way seems impossible” (p. 16). The experimental conditions alluded to by Reckase (1998) imply random assignment of persons to treatment and control groups; in this case, some persons would be assigned to a testing condition and others to a no-testing condition, and data for both groups on some
outcome variable of interest (e.g., graduation rates, patient satisfaction surveys) would be compared. However, randomized controlled trials are not the only way to establish causation. Strictly speaking, three conditions are necessary to demonstrate that some event, “A,” caused some other observed event, “B.” Those conditions are: (1) A preceded B; (2) there is a statistical association (“covariation”) between A and B; and (3) there is no other factor, C, that may have occurred with A to cause B. As might be expected, the likely presence of many potential C factors hampers attempts to identify causation in the social sciences. Because of this, in nearly all social science research, claims of causality are extremely difficult to investigate and support. Consequently, causal assertions in the social sciences are rarely made and, when they are, they are often met with skepticism. The same social science research traditions for establishing causation apply to the causal claims implied by the terms “consequential validity” or “consequences of test use.” The causal claims that the use of a test caused some result are typically only asserted; rarely if ever is the kind of evidence necessary to support a causal claim even minimally investigated. Instead, it is typical for a causal claim to be merely asserted, along the lines of “the use of Test X causes increased dropouts in high school” or “the presence of Test X discourages some qualified candidates from applying” or “the use of Test X does not cause any improvement in the educational (or medical, or social service, or personnel selection, or other) system.” In conclusion, the notion of consequential validity is a causal assertion. Strong evidence in support of any causal assertions are necessary; such evidence is rarely if ever provided in support of claims regarding the consequences test use.
Three Practical Problems with Consequential Validity If only considering the conceptual issues detailed in the preceding section (i.e., lack of definition, temporal sequencing, and unsupported causal claims), there are substantial flaws in the notion of consequential validity; these alone seem so significant and difficult to overcome as to be sufficient for excising consequences from validity theory. However, even ignoring these conceptual difficulties, the idea of using consequences of test use as validity evidence brings about practical problems for those who engage in validation efforts and has vexing implications for testing practice. In the following section, three such problems are illustrated.
1. The Problem of Delimitation
The first practical problem with incorporating consequences into the concept of validity is that the boundaries of appropriate consequential validity evidence cannot be circumscribed. The array of potential consequences that might be considered as part of a validation effort has not been delimited, leaving those charged with the work guessing as to what kinds of evidence related to consequences are appropriate. This problem is not new and not unrecognized. In fact, the issue arose at the very introduction of the concept of consequential validity. Referring specifically to the pragmatic difficulty of incorporating test consequences into validation practice, Messick demurred: “There are few prescriptions for how to proceed here because there is no guarantee that at any point in time we will identify all of the critical social consequences of the testing, especially those unintended side-effects that are remote from the expressed testing aims” (1988, p. 40). Reflecting on this observation, the most concrete advice Messick was able to muster was that the effects of the testing might be compared to “the potential social consequences … [of] not testing at all” (p. 40). Other researchers have recognized this problem and have sought to illustrate that at least some consequences could be considered off-limits. Contemplating the context of large-scale achievement testing, Shepard opined that: “I, for one, would not hold test publishers responsible for all possible test uses. Makers of standardized tests are not responsible for the effect of scores on the real-estate market, for example” (1997, p. 13, emphasis added). In other words, it would seem that virtually nothing is off the table. Elsewhere, Shepard (1993) confirmed that what remains on the table is nearly limitless, recommending that essentially unanswerable questions must be addressed to demonstrate consequential validity, such as “Does a credit-by-examination program improve social mobility?” (p. 426) and recommending that validation should “consider hidden assumptions … about what test use will accomplish” (p. 423). According to Kane, a test’s stakeholders should also be polled for consequential validity evidence: “Many different kinds of evidence may be relevant to the evaluation of the consequences of an assessment system … and many individuals, groups, and organizations may be involved (2001, p. 338). Moss has suggested the interpretations of test score inferences as a source of evidence, including the meaning that examinees make from test scores, the circumstances in which score interpretations are made, the sociohistorical characteristics of the contexts, and “the forms of interaction and mediated quasi-interaction about the message; and the discursive elaboration of the mediated messages” (1998, p. 10). How can these suggestions for collecting consequential validity evidence be summarized? Such recommendations require collecting evidence on the most ineffable variables, stakeholders’ mediated interpretations of intended interpretations, bearing on the widest range of applications including off-label uses and “hidden” purposes that have not even been suggested by the test developer. Could those who engage in test validation ever possibly meet
—or even understand—such expansive and ephemeral standards? Charged with investigating the effect of a credentialing, educational, or psychological test on social mobility across diverse populations and sociohistorical contexts, it is difficult not to empathize with those responsible for conducting rigorous validation efforts for simply throwing up their hands, and easy to see why validation practice might be languishing. The dilemma facing those who develop, administer or use test results is clear: Because no sound guidelines have been—or likely could be—developed to delineate which stakeholders, perspectives, groups, or sources of consequential evidence are relevant, the case for validation of any particular score inference is easily threatened. This issue was perhaps first articulated by Pace who, in the course of offering critique of a specific test noted that: The[se] … criticisms are not technical ones; they are educational and social. What is being reviewed here is not just a set of tests, but an educational program for guidance and placement. Hence, the relevant criteria for criticism are educational, social, and philosophical, not simply psychometric. If one grants the educational and political import of testing, then all sorts of criteria become relevant. (1972, p. 1027, emphasis added) More recently, Borsboom, Mellenbergh, and van Heerden noted the devolution of validity theory along the same lines, observing that: Validity theory has gradually come to treat every important test-related issue as relevant to the validity concept … In doing so, however, the theory fails to serve either the theoretically oriented psychologist or the practically inclined tester … A theory of validity that leaves one with the feeling that every single concern about psychological testing is relevant, important, and should be addressed in psychological testing cannot offer a sense of direction. (2004, p. 1061) Attempts to address this problem have not provided clear direction. Kane’s (2006a) chapter in the fourth edition of Educational Measurement was the first major attempt to craft a complete and coherent theory of validity since Messick’s (1989) chapter in the third edition. Kane’s work, however, fails to take on the fundamental conceptual problems with consequential validity and does not provide greater clarity as to how a validation process involving consequences should proceed. For example, although Kane observes that the potential stakeholders in any testing process are diverse and numerous, he notes that “any consequences that are considered relevant by stakeholders are potentially relevant to the evaluation of how well a decision procedure is working” and that “the measurement community does not control the agenda; the larger community decides on the questions to be
asked” (p. 56, emphasis added). However, Kane does not then address the clear problems of stakeholder identification; power differentials among stakeholders; and lack of agreement among stakeholders on the appropriate validity questions to be addressed or even on what the intended inference is. Surely, conflicts will arise in the proposed negotiations. As a means of addressing such conflicts when contending constituencies or competing claims are present, Kane has suggested that “agreement on interpretations and uses may require negotiations among stakeholders about the conclusions to be drawn and the decisions to be made” (2006a, p. 60). Thus, rather than clarifying the role of test consequences in the validation process, it would seem that current validity theory has done the opposite, broadening the population of potential stakeholders without clear guidance as to how the appropriate stakeholders for any situation should be identified or limited, and handing over the test maker’s intended score interpretation to an ascientific construct-definition process without practical suggestions for conducting and arbitrating what would surely be high-stakes, politicized, and contentious negotiations. To be sure, well-designed and executed studies to identify stakeholders’ interests regarding a proposed test use, focus groups of various audiences to learn the kinds of appropriate and inappropriate interpretations that are made of test scores, or a variety of other measures, would surely yield interesting and useful results. However, as regards the status of the results as validity evidence, it is not relevant to consider whether such results are informative: they are. The important validity question is this: “Would any of that information bear on the accuracy of the intended inference to be drawn from the test score?” And, regarding delimitation, if nearly any source of consequential evidence is considered to affect score validity, then it would seem impossible to exclude any perspectives, constituencies or interests when validity judgments are made. In the end, this state of affairs again points to consequences as not being a part of validity, but of something else. In sum, the attempt to incorporate consequences of test use into validity theory has not helped to clarify the concept itself or to delimit the possibilities for legitimate sources of evidence. Just the opposite. The menu of desirable or necessary sources of consequential validity evidence has expanded exponentially. As will be shown, the expansion may have precipitated unintended consequences of its own.
2. The Problem of Practice Reminiscent of Ebel’s observation about validity that “the good works done in its name are remarkably few” (1961, p. 640), Brennan observed—45 years later—that “validity theory is rich, but the practice of validation is often impoverished” (2006, p. 8). As one example, a recent review from the medical professions of validity evidence related to patient simulation format examinations covered 417 studies. Of those, only 217 studies had information related
to validity so as to be eligible for inclusion in the review. Of the 217, the authors of the review found that only six described validity evidence along the lines of the modern (i.e., unified) validity framework. Further, “one-third reported no validity evidence; one-third reported either content, reliability or relations evidence, [and] the most commonly reported validity evidence was the relation with learner characteristics” (Cook et al., 2013, p. 872). The authors concluded that regarding the reporting of validity evidence, “conditions are not improving” (p. 883). Ironically, the attempted incorporation of consequences into validity may have had unintended consequences. Among other characteristics, a useful theory of validity would aid in the identification and solution of practical problems and foster improved practice. There is now accumulating evidence that the incorporation of consequences into validity theory has the opposite effect. The evidence comes from measurement theorists who openly fret about the state of validity theory, and from research into validity reporting practices. An example of the former is seen in Kane (2006a) who has observed that “validity theory … seems to have been more successful in developing general frameworks for analysis than in providing clear guidance on how to validate specific interpretations and uses of measurements” (p. 18). Perhaps more troubling are suggestions that the lack of theoretical clarity may actually be contributing to weaker validation efforts. For example, Frisbie has expressed the concern that misunderstandings about validity “can lead test developers and users to unintentionally shortcircuit the validation process” (2005, p. 23). Recent research suggests that many—perhaps most—measurement specialists have implicitly rejected, or at least have been ignoring, consequences of testing as a source of validity evidence. For example, a study by Cizek, Rosenberg, and Koons (2008) used the Mental Measurements Yearbook (MMY, Spies & Plake, 2005) to examine three characteristics: (1) the extent to which information contained in validity reports conformed to the major aspects of modern validity theory; (2) the specific sources of validity evidence reported; and (3) the validity evidence reviewers considered to be most important. Only 27 of the 283 MMY entries cited either Messick’s (1989) chapter on validity or the current Standards in support of claims about validity. Similarly, findings reported by Cook et al. (2013) showed a less than enthusiastic embrace of modern validity theory. Regarding so-called consequential validity in particular, Cizek, Rosenberg, and Koons (2008) found that the three most frequently mentioned sources of validity evidence in MMY reviews were construct, concurrent, and content (identified in 58.0, 50.9, and 48.4% of the tests, respectively); validity evidence based on test consequences was noted for only two tests. Related research by Taylor and Sireci (2019) illustrated what the authors called “the disconnect between theory and practice” (p. 1) related to consequences of testing. The authors reviewed information about validity available on the websites of large-scale K–12 student assessment programs. A representative example from one of the sources revealed
that, although 42 research reports ostensibly providing information on consequential validity were cited, “none of the 42 specifically mentioned consequences of testing” and the reports focused mainly on “general testing perceptions and values” (p. 14). Additional information on the extent to which consequences of test use has been rejected by measurement specialists as a source of validity evidence was produced in another study (Cizek, Bowen, & Church, 2010) in which conference programs for the three sponsoring organizations of the Standards (the American Educational Research Association, the American Psychological Association, and the National Council on Measurement in Education) were searched for the terms validity, validation, consequences, and consequential. Table 3.1 provides a summary of the results which indicate that, while validity and validation generally are topics frequently addressed by researchers in these organizations, essentially no attention was paid to the area of consequential validity. The intersection of the terms consequential and validity was never found. In the few cases in which consequences or consequential appeared (7 and 0 instances, respectively), the actual topic addressed was never the consequences of a test use, but the consequences of a policy. For example, of the six AERA papers with “consequences” in the title or description, three addressed the consequences of the accountability requirements of the No Child Left Behind Act (2002), two addressed the consequences of a specific state accountability system, and one addressed consequences of implementation of pre-service teacher portfolios. Table 3.1 Incidence of consequential validity in professional association presentations Search Term
Validity Validation Consequences Consequential
Professional Association (Year) APA (2006)
AERA (2007)
NCME (2007)
Totals
47 42 0 0
30 24 6 0
5 1 1 0
82 67 7 0
In summary, although it would seem reasonable to expect that considerable attention would be given to what is currently included in the Standards as a potential source of validity evidence, the near complete absence of attention to one particular source provides support for the notion that it has been tacitly rejected by many, perhaps most, measurement specialists. Of course, this finding should not be interpreted as an (illogical) argument that because people don’t do something, therefore they shouldn’t do it. Rather, the finding supports an assertion regarding why people have done something. The fact that practitioners have essentially ignored consequences as validity evidence is an additional piece of information that supports a plausible hypothesis; namely, that practitioners do not appear to gather or report evidence on validity based on consequences because consequences are not a logical part of validation. Whereas there may be a number of possible explanations for why a
putatively vital source of validity evidence is uniformly ignored, none is as parsimonious or plausible as the proposition asserted here: the incorporation of consequences into the theory of validity was simply an error.
3. The Problem of Location If consequences are not part of validity, then how should the consequences of testing be captured as an essential aspect of the conduct of testing? Surely test consequences are important. For that reason, some theorists have argued that the failure to incorporate them into validity would relegate consequences to second class status and attention to consequences would disappear. For example, Kane has insisted on retaining consequences in validity, arguing that, “to say that social consequences count against validity only when they are due to sources of invalidity is to give them a secondary role in validation” (2006a, p. 55). This argument seems convoluted: that some evidence should “count against validity” because it is a source of invalidity seems like a good thing; if some evidence is not a source of invalidity, it is not at all clear why it should count against validity. It is hard to imagine any viable theory of validity in which sources of invalidity are not given primary attention in evaluating the tenability of test score interpretations. The concern about the status of consequences was perhaps most succinctly articulated by Linn, who argued that “removing considerations of consequences from the domain of validity … would relegate them to a lower priority” (1997, p. 16). Indeed, this concern appears to be at the heart of objections to a theory of validity that fails to incorporate consequences, but there are several persuasive rebuttals. For one, as has been described previously, there has been a de facto rejection of the concept of consequential validity in any practical sense. That is, given the nearly complete absence of attention to consequences as a source of validity evidence in current literature, research, and validation efforts, it is hard to imagine that the concept of consequential validity could be relegated to any lower priority. For another, there is a logical problem with arguing that the importance of a thing demands its inclusion into the domain of validity. There are many important aspects of educational measurement, but their importance does not require that the definition of validity should be expanded to accommodate them. Finally, the existing literature on validity does not provide guidance regarding alternative frameworks that would provide a conceptual home for both validity and consequences. Just as there is widespread indifference on the part of practitioners toward inclusion of consequences as a source of validity evidence, there have also been few formal attempts at reconciliation of the problem of consequential validity. Only rarely has a significant theoretical work been offered by individual scholars (e.g., Borsboom, Mellenbergh, & van Heerden, 2004) or a professional association (e.g., Society for Industrial and Organizational
Psychology [SIOP], 2018) that rejects linkage of test consequences as bearing on the validity of test score inferences. For example, the SIOP Principles for the Validation and Use of Personnel Selection Procedures assert that consequences of test use are not a validity concern but “constitute a policy issue for the user,” and that consequences of testing might inform policy or practice decisions, but “such consequences do not threaten the validity of inferences that can be drawn from the … test scores” (2018, p. 6).
The Most Fundamental Problem with Consequential Validity Among the amalgam of problems described above bearing on considering consequences of test use as a source of validity evidence, the problem of definition stands out. Because a definition of validity is not presented in the current edition of the primary professional reference, Educational Measurement, perhaps the most familiar current definition of validity is Messick’s widely cited description offered in the previous edition. That definition contains two components. Messick describes validity as the act of developing “an interpretive inference … to ascertain the degree to which multiple lines of evidence are consonant with the inference, while establishing that alternative inferences are less well supported” (1989, p. 13); this description clearly focuses validity on the intended score interpretations. However, Messick then added to the description, asserting that “to validate an inference requires validation not only of score meaning but also of value implications and action outcomes … and of the social consequences of using score for applied decision making” (p. 13). The result of that addition is a double-barreled definition of validity. Ultimately, Messick extended that double-barreled approach to his oft-cited definition of validity as “an integrated evaluative judgment of the degree to which empirical evidence and theoretical rationales support the adequacy and appropriateness of inferences and actions based on test scores or other modes of assessment” (1989, p. 13, emphasis in original). In short, the contemporary conceptualization of validity suffers from defining a putatively unitary concept—validity—as two things: (1) the extent to which evidence supports an intended test score inference, and (2) the extent to which the subsequent actions or consequences of using a test align with (implicit or explicit) values and intended actions or outcomes. Of course, the fundamental problem with a double-barreled definition is that a single concept cannot be defined as two things. The problem is compounded by the fact that the two characteristics are not highly similar, but are two qualitatively very different things where one aspect (i.e., the accuracy of score inferences) is conceptually distinct from the other (i.e., the uses of test scores or the actions taken based on test results). In the case of the hypothetical blood test described at the beginning of this chapter, how could one possibly come to a single, integrated conclusion about validity, integrating evidence from the clinical accuracy
of the procedure and the observed social consequences related to suicide rates? On a 30-item achievement test comprising 15 French vocabulary items and 15 geometry items, what sense can possibly be made of a score of 21? The most fundamental problem with incorporating consequences as a source of validity evidence as suggested by Messick (1989) is that it requires integration of that which cannot be combined to yield a coherent result. The current edition of the Standards perpetuates the conflation of score meaning and test use, indicating that validity “refers to the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests” (AERA, APA, NCME, 2014, p. 11). This current conceptualization of validity requires the gathering and synthesis of empirical evidence, theoretical rationales, and information about two important but incompatible dimensions: evidence bearing on an intended inference and evidence regarding consequences of test use. However, the qualitatively different nature of these sources of evidence precludes making the kind of integrated, evaluative judgments demanded. In practice, evidence gathered to bear on the intended meaning of scores yielded by a test is typically not relevant to the separable concern of how those test scores should be used or the consequences of using them in any particular manner. For example, strong content validity evidence for a mandated biology end-of-course test for high school graduation would support the inference that the test measures the biology knowledge and skill specified in a district’s curriculum. Such evidence would be necessary to support the use of the test for making any interpretations related to students’ mastery of the biology curriculum. Beyond that, the evidence gathered in support of intended score meaning would not provide support for using the test as a basis for awarding high school diplomas. Conversely, evidence related to the consequences of using the test—for example, suppose that the use of the biology test as a graduation requirement increased (or decreased) student persistence in high school— would provide no evidence relevant to the claim that the test was well-aligned to the biology curriculum. Indeed, a wealth of positive consequences of using the test (e.g., increased attendance in the biology course, increased student ratings of satisfaction with their biology courses and interest in biology, increases in subsequent advanced placement biology course taking, increased grades in related courses such as chemistry and anatomy, increased rates of students pursuing STEM careers in college, etc.) would not provide any evidence about what the test actually measured. In general, the meaning, interpretation, or intended inferences based on the test—that is, the validity of the test scores—is unaffected by actions based on the test scores, the uses of the test results or the consequences of those uses. Overall, evidence that the use of a test has certain valued benefits or detrimental consequences says nothing about what the test scores mean; evidence supporting score validity (i.e., abundant evidence based on test content, response process, internal structure, or relationships to other variables) says nothing about whether test scores should be used for
any specific purpose. Two implications of this should be clear: (1) even the strongest evidence in support of the intended meaning of test scores is a necessary but insufficient condition when considering a test for any specific use; and (2) it would be professionally reckless to use test scores for any given purpose when the very meaning of scores produced by the test is unsupported or unclear. In conclusion, these sources of information—evidence about the extent to which a test yields accurate inferences about an examinee’s standing on a construct and evidence about the broader consequences of using the test—are not compensatory, nor can they be combined into any coherent, integrated evaluation. Separate conclusions (i.e., conclusions about the meaning of scores and about the prudence of using scores obtained from a test in any given way) should be reached. However, any attempted integration of the distinct sources of evidence confounds conclusions about both score interpretation and the desirability of using the test. The current conceptualization of consequential validity presents those engaged in applied testing with the impossible task of producing an integrated evaluative judgment and a synthetic evaluation of that which cannot be integrated or synthesized. The synthesis of evidence bearing on the accuracy of test score inferences and evidence bearing on the appropriateness of test score use envisioned by Messick (1989) and others is neither logically nor practically possible. It is not surprising that, in over 30 years since it was proposed, no example of Messick’s proposed outcome (i.e., a synthesis of theoretical, empirical, and social consequence data yielding an overall judgment about validity) has ever been produced.
Can Consequences of Test Use Ever Provide Validity Evidence? It was mentioned previously in this chapter that, although the term consequential validity has come to be understood as the incorporation of test consequences as bearing on validity, it has been argued from both logical and practical viewpoints that this is not possible. In fact, there are rare circumstances in which consequences of test use can actually cycle back to inform the intended meaning of a test score or refine the claims made about the construct a test purports to measure. This special case of consequences of testing informing intended score meaning occurs when post-testing evidence reveals mis- or under-specification of the construct. An illustration of this was provided by Guion (1980). Guion described an experiment in which male and female participants were judged on their speed in packaging golf balls into cartons from an assembly line that was placed at a specified distance from the participants. A very short distance between participants and the assembly line advantaged females (who, on average, had shorter arms than males) compared to males who found the working conditions too cramped for rapid movement. The resulting consequence data—that is, the greater failure
rate for the males—were evidence that the construct, packing speed, had been mis-specified; an unintended and irrelevant factor (arm length) was affecting the test results. Importantly, the illustration also shows how consequences did not affect the accuracy of an inference— that is, that females, in fact, tended to be speedier packagers of golf balls than males under the specified conditions. It also clearly shows how evidence obtained after a test has been administered can be valuable in identifying aspects of the test design or testing conditions that are not consonant with the intended inference about the construct. It is precisely the kind of result just described that is captured by the SIOP validity guidance related to consequences. According to Principles for the Validation and Use of Personnel Selection Procedures, the golf ball packing results illustrate the principle that “such evidence is relevant to inferences about validity only if the negative consequences can be attributed to the measurement properties of the selection procedure itself” (2018, p. 6). Elaborating on this principle, the SIOP guidance provides an example in a context where the use of a test results in differential consequences for subgroups of test takers (e.g., males and females): Subgroup differences in test scores and subsequent differences in selection rates resulting from the use of selection procedures are often viewed as a negative consequence of personnel decisions. Group differences in predictor scores and selection rates are relevant to an organization and its personnel decisions; yet, such differences alone do not detract from the validity of the intended test interpretations. If the group difference can be traced to a source of bias in the test (i.e., measurement bias), then the negative consequences do threaten the validity of the interpretations. Alternatively, if the group difference on the selection procedure is consistent with differences between the groups in the work-relevant behavior or outcome predicted by the procedure (i.e., lack of predictive bias), then the finding of group differences could actually support the validity argument. (p. 6) Unfortunately, it is not typically this kind of construct-relevant evidence that is meant when consequential validity is referenced. Rather, policy implications or other social consequences of testing are invoked as bearing on the validity of a test when they have no relationship to the meaning of the test scores.
Conclusions and the Harm in the Status Quo In summary, it is plausible that the error of incorporating consequences has led to several undesirable outcomes, including discontent in the field, theoretical confusion, logical
intractabilities, inexhaustible possibilities for evidence to consider, absence of criteria for excluding potential sources, and a pervasive apathy regarding inclusion of consequences information in technical documentation; in addition, the incorporation may well be a contributing factor to less than enthusiastic validation efforts. Even if evidence about consequences of test use was gathered, there is no way that information can by synthesized with other sources of evidence to result in an overall, coherent judgment about appropriate score inferences. It has been demonstrated in this chapter that the flaws associated with considering consequences of testing as evidence of validity are so substantial that the notion must clearly be rejected. Naming consequential validity as an essential source of validity evidence logically conflicts with using a test in any way that has consequences, thus preventing such evidence from ever being collected and arguing against even using the test in the first place. Ironically, the incorporation of consequences of test use into the concept of validity—perhaps originally intended to strengthen validation efforts—may have actually served to weaken validation efforts generally. So what should be done? As Kane has suggested regarding the combined interpretationand-use model of validity, “in the best case, if the interpretation model was implemented rigorously, the interpretations of test scores would be validated, and then, in a separate process, the uses of the test scores would be evaluated” (2015, p. 8, emphasis added). In another place, Kane also observed that “In developing [the Standards], the organizations have put the requirement for evaluation of proposed interpretations and uses under the heading of validity. We can change that and put some or all of these issues under some other heading … but if we do so, we will have to reformulate much of the standard advice provided to test developers and test users” (Kane, 2009, p. 62). Amen. That is precisely the work that needs to be done. Modern validity theory has a small problem that must be confronted. It is insufficient merely to point out the error of including consequences of test use as an aspect of validity. It is unacceptable to simply jettison concern about consequences of testing altogether, but it is equally untenable for consequences of test use to maintain its uneasy coexistence with other sources of evidence in support of intended score meaning. Consequences must be incorporated into a comprehensive framework for defensible testing. Modern validity theory has a greater problem, however: the conflation of test score meaning and use. Clinging to an unworkable conglomerate conceptualization has yielded no benefits. Continued wrangling about semantic or epistemological nuances won’t accomplish the ultimate goal of ensuring that test information is dependable, fair, accurate and useful. A comprehensive framework for defensible testing is needed. Such a framework must address two equally fundamental, important and separable concerns: validation of score meaning, and justification of test use. Ultimately, it is hoped
that such an alternative framework will not only redress the comparatively lesser concern regarding the controversial place of consequences, but will advance clarity regarding the concept of validity and, ultimately, address the more substantial need to engender greater enthusiasm for searching and coherent validation efforts.
4
A COMPREHENSIVE FRAMEWORK FOR DEFENSIBLE TESTING PART I Validating the Intended Meaning of Test Scores
The quality of validation evidence is of primary importance. (Society for Industrial and Organizational Psychology, 2018, p. 5)
The purpose of this chapter is two-fold. The first part provides a rationale and explication of the two components of a comprehensive approach to defensible testing: (1) validation of intended test score inferences; and (2) justification of intended test score uses. The second part of the chapter provides detail on the first half of the framework: the processes and evidentiary sources for addressing the validation of intended test score inferences.
A Comprehensive Framework for Defensible Testing A full, comprehensive framework for defensible testing must address two essential aims. These aims can be thought of as two, different, important, and related but distinguishable research questions: the validation of the intended test score meaning, and the justification of intended test score use. In generic form, the two aims can be represented as two different research questions: Research Question 1: “What do these scores mean?” Research Question 2: “Should these test scores be used for Purpose X?” (where X is a specific test use). Many variations of these questions flow from the variety of specific testing contexts that exists. Table 4.1 lists pairs of interrogatories to illustrate this essential difference between the equally essential tasks of validating score meaning and justifying test use.
Table 4.1 The different research questions addressed in validation and justification Context
Research Question 1: Validation of intended test score inference
Biology “Do these end-of-course biology test scores reflect knowledge mastery of the high school biology content standards?” Clerical skill “Can these test scores be interpreted as an examinee’s level of skill and accuracy in performing clerical tasks?” Depression “Do scores on this instrument reflect adult test takers’ levels of depression?”
Research Question 2: Justification of intended test use “Should these end-of-course biology test scores be used for awarding high school diplomas?”
“Should these scores be used as part of our company’s screening or hiring process for clerical staff?” “Should this instrument be used as a pre- and postintervention measure to evaluate intervention effectiveness?” College “Do these ACT/SAT scores measure high school “Should these ACT/SAT scores be used for college admissions preparation for success in college?” admission decisions?” Self-harm “Can these test scores be interpreted as indicators of “Should these scores be used as part of an overall middle school students’ risk of self-harming middle school mental health screening?” behaviors?”
There are at least four important conclusions to be drawn from the entries in the table. First, the range of contexts illustrated in the table provides some sense of the broad applicability of the framework, extending across testing applications in psychology, personnel selection, achievement testing, and other contexts. Second, a comparison of the entries in the Research Question 1 and Research Question 2 columns reveals just how different the two essential aims of defensible testing are. These differences support the notion that validity cannot be considered to be a single concept incorporating both test score meaning and use involving a synthesis of information on two such disparate research questions. Rather, the distinct questions addressed regarding validation of intended test score meaning and justification of intended test score use require separate investigations. And, as will be seen, answering the two qualitatively different questions requires that different sources of evidence be gathered. Evidence gathered to bear on the question of score meaning would not typically be relevant to answering the question about use; consequently, any evidence gathered on one of the questions is non-compensatory with respect to the other. For example, strong evidence based on test content for the biology end-of-course test would support the inference that the test measures biology knowledge and skill. Such evidence would be necessary to support the use of the test for making any interpretations related to students’ mastery of the biology course curriculum. However, beyond providing confidence in what those test scores mean, it would not provide support for using the test as a basis for awarding high school diplomas. Third, it should be obvious that each of the research questions for a given context shown in the table is equally valuable; that is, because they address distinctly different aims, it is not possible to say, for example, that gaining a deep understanding of the meaning of the biology test scores is more or less important than the decision about whether the achievement test
scores should be used as part of a process for awarding or denying high school diplomas. Fourth, considering only the first question in each pair—that is, the validity question—it seems evident that only the developer of a test is in a position to investigate the question and to possess the time and other resources required to engage in the kinds of research, development, and other activities necessary to answer it.
Foundations of Validating Intended Test Score Inferences As stated previously, before a test score can be contemplated for some intended use (consequential or otherwise), it must be demonstrated that the score can be confidently interpreted as intended. This notion of evidentiary support for the intended score inference is central to the concept of validity. In Chapter 5, a proposed framework for gathering evidence to justify an intended use will be presented. The focus of the present chapter, however, is on sources of evidence in support of intended score meaning. The starting point for a comprehensive reconceptualization that addresses both validation of score inferences and justification of test use is the definition of validity presented in Chapter 1: Validity is the degree to which scores on an appropriately administered test support inferences about variation in the construct that the instrument was developed to measure. That definition rests on two weak assumptions: (a) that all tests are developed with the common purpose that scores on the test will reflect variation in whatever characteristic the test is intended to measure, and (b) that all tests are developed for use in at least one specific population or context. These presuppositions compel reconsideration of the second principle identified in Chapter 2 as a point of broad agreement related to validity: namely, that validity is not a property or characteristic of an instrument, but of the scores yielded by the instrument. As it turns out, validity is situated between those two positions: it is a property of inferences about scores, but those scores are inextricably generated by and linked to a specific instrument, administered under acceptable conditions to a sample of examinees from the intended population. To elaborate on the notion that validity refers to scores on an instrument reflecting variation in the construct the instrument was developed to measure, an illustration may be helpful. Figure 4.1a portrays a hypothetical group of test takers, arrayed in a line from the lowest standing on the construct of interest (i.e., least motivated, least anxious, least skilled, least knowledgeable, etc.) to the highest standing on the construct of interest (e.g., most motivated, most anxious, most skilled, most knowledgeable, etc.). Typically, test takers do not array themselves in this way naturally; if that were the case, there would be no need for tests. The figure illustrates this reality with the cloud drawn around the test takers, indicating that their true status on the construct of interest cannot definitively be known. However, the
figure also illustrates a core tenet of measurement in the social and behavioral sciences; that is, there is variation in these examinees’ standing on the construct of interest.
Figure 4.1a Hypothetical distribution of examinees on construct of interest Tests are useful, however, to the extent that they yield scores that are valid; that is, scores that, as accurately as possible, reflect examinees’ standing on the construct of interest. Figure 4.1a also shows a test score scale at the bottom of the figure. In this case, the scale is a commonly used percent correct scale ranging from 0 to 100%. When a test intended to measure the construct is developed, administered to the group of examinees, and scores are obtained, there will also likely be variation in the examinees’ scores. We now have the two fundamental ingredients to describe validity. In nearly all measurement contexts: (1) there is variation across test takers in their standing on the construct of interest; and (2) there is variation across test takers in their observed performances. It is the relationship between the variation in the characteristic measured by a test and the variation in observed performances that is the central question probed in any validation study. Conceptual illustrations of three possibilities for that relationship are shown in Figures 4.1b, 4.1c, and 4.1d.
Figure 4.1b The “ideal” validity situation
Figure 4.1c The “worst case” validity situation
Figure 4.1d A “typical” validity situation Figure 4.1b illustrates what can be thought of as the ideal validity situation aligned to the definition of validity provided above. In this figure, the underlying variation in examinees’ standing on the construct of interest is perfectly reflected in the observed variation in their test performances. (It is important to note that Figure 4.1b is provided as a heuristic only; it is not intended to convey a literal reality in which there is an isomorphic correspondence between a quantifiable percentage of domain mastery and amount of a construct.) Those who received the information yielded by the situation portrayed in Figure 4.1b could have a high degree of confidence in the inferences made from these examinees’ test scores regarding their standing on the construct of interest. Taking this a step beyond the established confidence in the meaning of these scores, in this situation, there would be strong potential for the scores to be used in some intended manner although, as we will see in the next chapter, that case would need evidentiary support as well. At the other extreme, Figure 4.1c illustrates what might be described as the worst-case validity situation. In this figure, there appears to be no relationship between the scores examinees obtain on the test and their standing on the construct of interest. Recipients of the information yielded by such a test would have no confidence in the meaning of these scores vis-à-vis the characteristic about which they wished to make inferences. And, as will be explored later, because they lack support for their meaning, there would be no credible basis for using these scores for any purpose. The last figure in the series, Figure 4.1d, illustrates what is likely the typical validity situation—neither ideal, nor worst case, but where some confidence is afforded that variation in the scores examinees obtain on the test reflects variation in the examinees’ standing on the construct of interest.
Finally, and a brief aside, some readers may notice a resemblance between the concept of reliability and the conceptual definition of validity presented here (i.e., the relationship between variation in the characteristic measured by a test and variation in observed performances). This definition of validity calls to mind the classical test theory definition of reliability, ρxx′: ρxx′=σT2/σX2
where σT2 is the variance of true scores, and σX2 is the variance of observed scores.
The validity scenario illustrated in Figure 4.1c illustrates a situation in which the variation in standing on the construct (i.e., analogous to “true” variation in classical test theory) is the same as the variation in observed scores on the test—a situation that, in classical test theory terms, would represent nearly perfect reliability but, in validity terms, is clearly a “worstcase” scenario as represented. On the one hand, the noticeable resemblance can be thought of as merely another way in which the long-recognized relationship between reliability and validity can be expressed. That is, it is possible for test data to be perfectly reliable while having little to no validity. It also reinforces the aphorism regarding reliability of test data being a necessary but insufficient condition for valid test data. On the other hand, it also suggests a more profound relationship between reliability and validity than is possible to examine in this volume. To be sure, what is referred to in this volume as a “comprehensive framework for defensible testing” is likely not as comprehensive as it might be, given that it fails to capture the essential component of reliability. Some additional attention will be given to the location of reliability in a comprehensive framework for defensible testing later in this chapter when the sources of validity evidence are reconsidered. For now, the possibility of incorporating reliability more completely will be tabled, and attention will be directed toward how the conceptualization of validity presented here is related to the applied endeavor of validation.
Validity and the Validation Process Returning to the definition of validity presented earlier in this chapter, we now ask: “What are the implications of the definition of validity for how a validation effort should proceed?” In Chapter 1, it was noted that the fundamental starting point for thinking about validity and for all aspects of validation is a clear, specific statement about the intended meaning of scores that will be obtained from a test. Beyond that, there are well-established traditions for
gathering validity evidence. Figure 4.2 provides the first part of a comprehensive model of defensible testing. Reading the model from left to right, it can be seen that the validation process begins with the necessary prerequisite of a clear statement of the inferential claim or score meaning intended by the test developer. This statement guides the validation effort and gathering of evidence which is then evaluated with respect to the support provided for the claim.
Figure 4.2 Comprehensive model of defensible testing (I): Steps in validating score meaning After the intended inferential claim is asserted, the flowchart process then turns to validation of the intended score inference. The bidirectional arrow between the Intended Test Score Meaning and Validation of Score Meaning parts of the model reflects the recursive nature of the validation effort in which the gathering of validity evidence prompts reexamination and refinement of the intended inferential claim, which in turn suggests alternative validation strategies and new or additional sources of evidence. The Sources of Evidence supporting validation of the intended inferential claim are indicated below the Validation of Score Inference step. Greater detail on these sources is provided in the next section. Figure 4.2 then shows that, following collection of evidence related to the intended test score inference, Evaluation of the Validity Evidence occurs. This step involves what Messick (1989) referred to as the integrated, evaluative judgment of the extent to which the available empirical evidence and theoretical and logical rationales provide support for the claimed
meaning of the test scores. The integrated evaluation of the validity evidence is then expressed along a continuum with respect to whether the evidence tends to support the intended test score interpretation (positive) or is disconfirming (negative). Finally, at the bottom of Figure 4.2, it is shown that the validation portion of a comprehensive model of defensible testing must recognize the inherent presence and influence of values in the validation process; the importance of the explicit consideration of this aspect cannot be overstated. Values necessarily underlie the entirety of the testing enterprise (and, as will be shown in Chapter 5, value considerations are equally present in both the validation of intended score meaning and in the justification of intended test score uses). Focusing for now on the validation portion of the model, it is important to recognize the pervasive influence of values and for the values underlying the psychometric processes that guide test development and validation of score inferences to be explicitly considered. Most often, these values are likely to be based on implicit beliefs and assumptions (Longino, 1990), but they are nonetheless present from the outset. As Messick has observed: “Value considerations impinge on measurement in a variety of ways, but especially when a choice is made in a particular instance to measure some things and not others” (1975, p. 962), and “values influence not only our choice of measure or choice of the problem, but also our choice of method and data and analysis. Values pervade not only our decisions as to where to look, but also our conclusions as to what we have seen” (1975, p. 963, citing Kaplan, 1964). For example, all K–12 statewide student achievement testing programs in the United States include mandated, formal, every-pupil assessments in reading and mathematics. None include comparable assessments in the arts. Why not? At least partially, it is likely due to the fact that reading and mathematics learning may be somewhat easier to measure than learning in the arts. However, it is also at least partially attributable to a greater consensus on the part of policy makers regarding the value of learning to read and solve mathematical problems than regarding the value of producing artistic works. In summary, values and beliefs come into play in the validation effort when decisions are made about: • • • • •
developing a test in the first place; what sources of validity evidence are relevant; what sources of evidence should be sought out (and which need not); what weight should be given to gathered evidence; and how the weighted evidence should be summarized.
Finally, whereas value considerations are present and often implicit in validation efforts, as we will see in Chapter 5, they are not only present but typically considerably more visible
in the justification effort. For example, suppose it was decided to use one test instead of another based on psychometric considerations, perhaps because the predictive validity coefficient of one test exceeded that of the other. Although it may not be recognized, value considerations underlie the valuing of this psychometric evidence over other kinds of evidence that could be used to inform the decision. However, justification deliberations are rarely conducted solely or even primarily on psychometric grounds. In contrast to validation efforts, value considerations are typically much more prominent—and often the very object of the justification effort when considerations such as the feasibility, cost, time, intrusiveness, perceived seriousness of false positive and false negative decisions, social consequences, and myriad other considerations arise that bear on the decision to actually use a test.
Major Threats to Test Score Meaning A primary element in the definition of validity presented previously in this chapter is the role of variation—variation in examinees’ standing on the construct of interest, and variation in their observed test performances. At the core of the definition is that validity represents the extent to which variance in examinees’ standing on the construct is reflected in variance in their test scores. Variation in examinees’ standing on a construct may be due to a single component (i.e., the measured construct is unidimensional) or to a constellation of specified factors that contributes to the variation (i.e., the measured construct is multidimensional). In either case, it is a primary goal of validation to identify and account for these intended sources of variation.
Construct-Relevant Variation and Construct-Irrelevant Variation In the “ideal” validity situation shown in Figure 4.1b, all of the variance in the examinees’ observed test performances is shown to correspond to variance in their standing on the construct. In other words, the variance in test performances is exclusively due to the one or more factors that are relevant to the construct intended to be assessed. The standard psychometric term for this is construct-relevant variation (CRV). For example, if the construct intended to be measured by a computer-administered test was dentistry knowledge and skill, and if the only reason that the test scores of candidates for dental licensure varied was because the candidates varied in their knowledge and skill of dentistry, then all of the variation in their scores would be CRV. Or, if the construct intended to be measured by a paper-and-pencil test was depression, and if the only reason that the test scores of clients varied was because they varied in their levels of depression, then all of the variation in their scores would be construct-relevant. Again, ideal CRV situations such as those just described are highly improbable; it is more typically the case that other factors are unintentionally measured and affect the observed
variation in test performances. These other factors that are not related to the construct of interest but affect examinees’ observed test performances are called sources of constructirrelevant variation (CIV). Given the desire of a test developer that scores yielded by an instrument can be interpreted by consumers of that information as intended, sources of CIV are particularly troublesome. CIV contributes to variation in examinees’ test performances in a way that makes it appear that examinees differ on the construct of interest, but in fact the observed difference may be illusory, leading to inaccurate interpretations of those scores. For example, let us again consider the hypothetical dental licensure examination and the test for depression. The two panels of Figure 4.3 portray CRV and CIV as proportions of the total observed variation in test scores. Each panel of the figure illustrates the sources of observed test score variation for the computer-administered test of dental knowledge and skill. The sources of construct-relevant variation are shown in each panel in bold, capital letters; the sources of construct-irrelevant variation are shown in italics.
Figure 4.3 Sources of variation in dental licensure examination scores and depression test scores The first panel of Figure 4.3 illustrates a hypothetical context that would provide somewhat supportive evidence of score validity. In this case, the greatest factor contributing to total observed variation in scores on the test is examinees’ dental knowledge and skill—the construct of interest. To be sure, other factors also contribute to variation in examinees’ test performances, as they always do. On the dental licensing examination, examinees having
greater familiarity with the use of a computer are shown to be somewhat advantaged by the fact that the test is computer-based. Computer familiarity would thus be an unintended factor contributing to score variation, or a source of CIV. A second source of CIV is also illustrated for the dental licensure examination: test-taking skills. In this hypothetical situation, examinee possession of greater test-taking skills contributes (positively, and unintentionally so) to their overall test performance. Both of these sources of CIV add “noise” to the measurement process to the extent that confidence in the intended score interpretation (i.e., examinees’ level of dental knowledge and skill) is weakened by the presence of CIV. How are these sources of CIV addressed? The ideal situation is one in which sources of CIV are avoided from the outset. In addition to the procedures and sources of evidence described earlier in this volume for building a strong body of evidence in support of an intended score meaning, the Standards contain advice for “minimizing construct-irrelevant components through test design and testing adaptations” (AERA, APA, & NCME, 2014, p. 57). Even in the presence of test development procedures that have taken it into account, the potential for CIV remains, and investigations should be conducted to detect its presence and magnitude. For example, in the situation illustrated in Figure 4.3, a test developer might discover computer familiarity as a source of CIV by administering a post-testing survey to examinees asking them to self-evaluate their technology proficiency, to rate the ease with which they were able to complete the computerized test administration, or to self-report the kinds of devices, computer software, operating systems, etc. that they regularly use; these types of evidence fall into the Standards category Evidence Based on Relationships with Other Variables. Test-taking skills as a source of CIV might be investigated by asking examinees about the content and intensiveness of any test preparation programs they had used prior to testing (Evidence Based on Relationships to Other Variables) or by observing the strategies examinees use to attack the test items in a cognitive lab or think-aloud process (Evidence Based on Response Processes). In essence, each of the four sources of validity evidence described previously in this chapter can not only serve as potential sources of evidence in support of the intended test score inference, but can also be useful in identifying threats to valid score interpretations. As regards both computer familiarity and test-taking skills, to the extent the test developer was able to identify those sources of CIV, strengthening the validity of the scores could be accomplished by, for example, producing candidate information guides that covered basic information about test-taking strategies and by providing earlier testing experiences for candidates that afforded experience with the computer-based testing platform, and practice with the procedures for accessing the test items, recording and changing responses on the computer. The second panel of Figure 4.3 shows a hypothetical context that illustrates discouraging
evidence of score validity. In this case, the greatest factor contributing to total observed variation in scores on the test is the collection of CIV factors that contributes to examinees’ scores on the depression instrument and not the examinees’ actual standing on that construct of interest. For example, it appears that scores on the depression measure are strongly influenced by the test taker’s gender and reading ability. A test developer faced with this information would almost certainly seek to have the test items reviewed and revised to eliminate stereotypical wording, contexts, or other factors that contributed to score differences among gender groups. In addition, the test developer would likely seek to reduce the reading load of the instrument to reduce the impact of respondents’ reading ability on their test scores. In addition, the putative test of depression appears to be measuring other constructs related to depression (e.g., anxiety, mood), but is not squarely focused on the intended construct. This means that clients who have higher or lower level standing on those related constructs are mistakenly interpreted as having higher or lower levels of depression. The last source of CIV for the depression test indicated in the second panel of Figure 4.3 is ethnicity. The interpretation of this source of CIV is the same as that for the gender, reading ability, and other sources; namely, that observed scores on the instrument are higher or lower depending on the test taker’s ethnicity and not exclusively due to the test taker’s real level of depression. The common label for this concern is bias or differential test functioning. However, it should be recognized that all sources of CIV are, in essence, sources of bias; they are factors that cause variation in test scores unrelated to the construct of interest. Thus, it can be seen that what is sometimes called “test bias” when test results are affected by factors such as ethnicity, gender, first language, reading ability—or any factor that is not the focal construct—is most appropriately considered as simply specific manifestations of CIV. Alternatively, bias can be defined in terms of validity. Recalling that validity refers to the extent to which test scores can be confidently interpreted to have the meaning they are intended to have, bias exists when that meaning differs for examinees scoring at the same score level.
Construct Misspecification and Construct Underrepresentation A second major threat to validity pertains to how the construct intended to be measured by a test is defined and operationalized. It is possible to make at least two kinds of errors along these lines in the development of a test. One error is construct misspecification. Construct misspecification occurs when the target of a test is not well aligned with the theory that underlies the construct of interest. A somewhat humorous illustration of such misalignment is found in the sarcastic praise a high school student gave to his chemistry teacher: “She was a very thorough teacher: Everything not covered in class was covered on her exam.”
Construct misspecification is a serious matter, however. Referring back to the depression test described in the previous section, it can be seen that at least some of the variation in the observed depression test scores is attributable to characteristics that may have some similarity to depression, but are not depression—namely, anxiety and mood. To the extent that the depression measure was developed to include questions that tapped these characteristics, the construct of depression was mis-specified. That is, the test was developed to an operationalization of depression that did not align well to accepted theory about the construct. Of course, a test developer may have intended to align to a novel conceptualization of depression. In such a case, the test developer would have the responsibility of providing information about the construct intended to be measured and the scientific, theoretical, or practice-based rationale for doing so. Additionally, the test developer should be cautious in selecting a name for the new test so as not to imply that it measured that construct in the same way as other contemporary, accepted depression instruments, and to provide clear guidance to users as to the intended meaning of the scores. Another example of construct misspecification can be seen in the example provided by Guion (1980; cited in Chapter 3) in which male and female participants in an experiment were measured on a construct that might be called “golf ball packing ability.” Results of the experiment revealed an unexpected result: female participants were noted to have significantly greater levels of that ability. As it turned out, the greater facility of the females was found to be due to the fixed distance that participants’ seats were placed from the golf ball packing assembly line: the short distance from the assembly line advantaged the females who, on average, had shorter arms than males, who found the working conditions too cramped. In fact, the golf ball example can be seen in two ways. First, the distance of participants’ seating from the assembly line can be viewed as a source of CIV. That is, if a packing company were most interested in measuring participants’ packing ability, then the distance of seating from the assembly line represents a factor that influences performance but is unrelated to the construct of interest. Second, if distance of seating from the assembly line was considered to be relevant to the construct a company wished to assess, then construct misspecification had occurred: the construct could not be accurately called “golf ball packing ability,” but should be further specified as “golf ball packing ability at a distance of X inches” or similar modification of the label used to identify the construct so that it more closely represented the skill or ability being assessed and to foster more accurate interpretations of scores yielded by the procedure. A second threat to the validity of score meaning is called construct underrepresentation. Construct underrepresentation occurs when what is assessed by a test covers some but not all of the critical attributes of the construct of interest. To provide two examples of construct underrepresentation, we refer back to the dental knowledge and skill examination described
previously. First, suppose that the specifications for the examination called for specified numbers of test questions in seven broad subareas: anatomy, microbiology, biochemistry, physiology, pathology, research, and professional ethics. Now, further suppose that (although it would most appropriately be conducted a priori) a subsequent job analysis for entry-level dentistry indicated that knowledge of the subarea of pharmacology was deemed necessary for safe and effective entry-level practice. The fact that this content was absent from the licensure examination would be an instance of construct underrepresentation. As another example, let us suppose that the area of pharmacology was added to the specifications and questions on that topic were included in the licensure examination. Although this addition would at least partially address the first concern about construct underrepresentation, it remains questionable whether the fullness of the intended construct, dental knowledge and skill has been adequately incorporated into the examination as developed and administered. Even given the advantages of a computer-based mode of administration that might allow for some technology-aided assessment formats, it is not likely that candidates can actually demonstrate their skills in intervention, evaluation, and other clinical actions in the computer-based format. Thus, an examination that intended to support inferences about dental knowledge and skill would likely still suffer from construct underrepresentation to the extent that an important part of the intended construct remained unaddressed in the licensure examination.
Reconsidering Sources of Validity Evidence We now return to the Sources of Evidence component of the framework described previously and illustrated in Figure 4.2 which shows the first half of the comprehensive framework for defensible testing. The Sources of Evidence component of the model captures all of the sources of evidence that could be brought to bear on the intended score meaning. These sources include—but are not limited to—the sources of evidence listed in the Standards and described in Chapter 2: Evidence Based on Test Content, Evidence Based on Response Processes, Evidence Based on Internal Structure, and Evidence Based on Relationships to Other Variables. In this section, two modest revisions to these sources are proposed.
Relationships among Variables One obvious revision to the sources of validity evidence as included in the Standards is that the sources Evidence Based on Internal Structure and Evidence Based on Relationships to Other Variables should be combined into a single source. A critical analysis of these sources reveals that they are essentially both examining relationships among variables, with the only difference between them a cosmetic and ignorable difference: what the Standards refer to as Evidence Based on Internal Structure comprises (see Chapter 2) analyses where all of the
variables examined in any analysis are internal to the test itself (e.g., factor analysis, dimensionality analysis, internal consistency analysis, etc.); what the Standards refer to as Evidence Based on Relationships to Other Variables comprises analyses where one or more of the variables examined are external to the test (e.g., group membership, treatment status, other test scores, and so on). Put simply, the variables typically studied under the current heading of “internal structure” are the items that comprise a measure and test examinees’ responses to those items within a given test; the variables typically studied under the current heading of “other variables” are test takers’ responses to the items in a test and their responses obtained on other measures. Clearly, both of these existing sources listed in the Standards represent examinations of relationships among variables, with the only difference being that under one heading the variables are internal to the test and under the other heading the variables are external to the test. It seems both reasonable and unifying to collapse these sources into a single source. That source might be called simply “Evidence Based on Relationships among Variables;” however, an additional revision seems warranted. It was asserted previously that validation activities should be planned and conducted in a theory-driven or logic-driven manner. Willy-nilly inclusion of variables in a validation effort simply because they are easy to collect, traditionally included, or have some surface appeal should be avoided. Whatever variables are included in the course of amassing validity evidence should be guided by theoretically grounded positions and research-, logic-, or evidence-based expectations about how those variables should be related. To highlight and promote this principle, it is suggested that the combined source of validity evidence be relabeled “Evidence Based on Hypothesized Relationships among Variables.”
Test Development and Administration Procedures The test development and administration processes are typically configured—implicitly or explicitly—to provide support for the intended test score meaning. According to Kane (2006b), the validity of test score inferences often has substantial ties to test development process. Regrettably, however, many of the steps included in the development and administration of a high-quality test that can add to the confidence in the intended score interpretation are either excluded from the current Standards, not acknowledged in the Standards chapter on validity, or scattered across a mix of chapters. For example, in its chapter on Fairness in Testing, the Standards note that prescribed test administration, scoring, and security protocols should be followed so that the validity of the resulting scores is supported (AERA, APA, & NCME, 2014, p. 65). Additional guidance related to test administration procedures is presented in the Standards chapter on the rights and responsibilities of test takers (pp. 131– 138) and other locations. Overall, it seems important to acknowledge the many, varied test
development and administration procedures as a unique source of validity evidence which might be called “Evidence Based on Test Development and Administration Procedures.” A complete listing of all possible procedures that might be subsumed within this category is beyond the scope of this book. However, some examples seem appropriate to illustrate the kinds of activities that might be recognized in this new category. Table 4.2 provides some examples of the processes, activities, and guidelines that are typical of high-quality test development and administration, but are not explicitly or fully acknowledged and integrated as sources of validity evidence. Table 4.2 Evidence sources based on test development and administration procedures Context
Source
Test development Item/task generation procedures Rubric development, verification procedures Item key reference materials documentation Judgmental bias/sensitivity reviews Age/developmental appropriateness reviews Human rater scoring qualifications, training, calibration, consistency checks Automated scoring algorithm development, cross-validation Test developer’s on-site procedures for ensuring integrity, security of test materials Test Test speededness research; establishment of test timing guidelines; availability and adequacy of administration breaks/pauses Candidate information materials, test administrator manuals Test administrator qualifications, training; proctor training Test site requirements to ensure logistics conducive to examinee performance (e.g., space, ventilation, seating, technology, distractions, etc. Test site procedures for ensuring integrity, security of testing materials and test administration Post hoc procedures for reviewing test administrations for potential test security violations (e.g., cheating)
It might be argued that some of the procedures and activities indicated in Table 4.2 are captured or are implied under the existing categories of validity evidence, Evidence Based on Test Content or Evidence Based on Relationships to Other Variables. However, even so, they are captured only awkwardly at best. It seems an appropriate initiative to recognize the unique and distinguishable contributions to confidence in test score meaning that are represented by these, and other, procedures in the test development and administration process and to provide guidance and best practices related to these procedures in the Standards.
A Revised Menu of Sources The subsections above suggest two modifications to the menu of categories that comprise sources for gathering evidence relevant to the intended interpretations of test scores. Taking into account these modifications leaves a slate of potential sources of validity evidence that is
very similar to that presented in the current Standards. The four sources would include: • • • •
Evidence Based on Test Content; Evidence Based on Response Processes; Evidence Based on Hypothesized Relationships among Variables; and Evidence Based on Test Development and Administration Procedures.
Finally, it should again be noted that absent from this list is one of the sources of validity evidence currently identified in the Standards: Evidence Based on Consequences of Testing. On the one hand, as was shown in Chapter 3, the consequences of testing do not actually bear on the intended meaning of test scores, so they are not properly considered as a source of validity evidence. On the other hand, attention to consequences is essential in a comprehensive model of defensible testing; to that end, the role of consequences information will be considered in depth in Chapter 5 where the second half of a comprehensive framework for defensible testing is presented.
Summary and Conclusions This chapter began by reconceptualizing the testing endeavor as presenting two primary research questions that must be addressed; namely, “What do these test scores mean?” and “Should these test scores be used [for some specified purpose]?” On the surface, such a reconceptualization might seem uncontroversial. However, because conventional thinking in validation has conflated score meaning and use, it has not been successful in accomplishing what adherents to the conventional approach have hoped—singular evaluative syntheses of evidence of both meaning and use. These two aspects of testing—that is, validation of intended test score meaning and justification of intended test score use—are distinguishable on many dimensions and cannot be combined into a single concept. However, taken together, the steps involved and evidentiary sources that are relevant to answering these questions comprise a unified, comprehensive framework for defensible testing. The first question, “What do these test scores mean?” is the essential question of validity. The pursuit of validity is the pursuit of a result in which observed test performances can be confidently interpreted as faithful indicators of examinees’ standing with respect to the construct intended to be tapped in the test. Thus, the meaning and interpretation of scores is the focus of validation efforts. A definition of validity was presented as “the degree to which scores on an appropriately administered test support inferences about variation in the construct that the instrument was developed to measure.” The remainder of this chapter reviewed methods, evidence sources, and challenges to answering the first question, along
with common concerns that arise which can threaten confidence in score interpretations. Suggestions for modest revisions to standard categories and sources of validity evidence were proposed. It was also asserted that even solid evidence yielding strong confidence in the validity of scores does not provide support for actually using those scores for any particular purpose (excepting for the caveat that confidence in the meaning of a test score is a necessary precondition for even considering a particular use). Beyond the validity evidence that should be gathered, synthesized and evaluated along the lines of the model of the validation process shown in Figure 4.2—the first half of the full model of defensible testing—a separate, equally rigorous and searching investigation must be undertaken to gather and evaluate evidence that may bear on any intended use of scores obtained from administration of a test. This important concern, Justification of Intended Test Score Use, is the focus of the next chapter.
5
A COMPREHENSIVE FRAMEWORK FOR DEFENSIBLE TESTING PART II Justifying the Intended Uses of Test Scores
In the best case, if the interpretation model was implemented rigorously, the interpretations of test scores would be validated, and then, in a separate process, the uses of the test scores would be evaluated. (Kane, 2015, p. 8)
The comprehensive framework for defensible testing introduced in Chapter 4 comprises two, equally important, endeavors: (1) gathering and evaluating evidence bearing on an intended test score meaning and (2) gathering and evaluating evidence to justify an intended test score use. Chapter 4 provided a full description of the activities related to the validation component of the framework; this chapter provides a parallel treatment for the justification component. In terms of the two distinct research questions that must be answered, the activities related to Research Question 1 (“What do these scores mean?”) were addressed in the previous chapter. This chapter provides information on Research Question 2: “Should these test scores be used for Purpose X?”
Purposes of Testing As Zumbo has observed, “It is rare that anyone measures for the sheer delight one experiences from the act itself. Instead, all measurement is, in essence, something you do so that you can use the outcomes” (2009, p. 66). The uses to which tests are put can vary substantially. A listing of some of the potential purposes is shown in Table 5.1. Those purposes are often taken into account during the test development process, or they may arise subsequent to test development. A few of the possibilities for the purposes of test development and use include: • a test may be developed strictly for research purposes; • a test may be developed for an applied purpose, but with no consequential use anticipated; • a test developer may initiate the test development process with a specific, consequential test use in mind; • a test developer may have vague ideas about the range of possible intended uses, but without a primary purpose in mind; and
• a test user may wish to use a test developed for a purpose that was not intended by the developer. Table 5.1 Some purposes of testing Selection Planning Research Guidance Program evaluation Self-information Diagnosis Evaluation (e.g., grading) Placement Classification Accountability Performance feedback Employment System monitoring
Importantly, despite the variety of conditions sampled above, the specific circumstances surrounding an intended test use do not change the reality that the two key questions must be answered—and that they demand different sources of evidence, standards, and invoke different values. The enterprise of validating an intended score meaning is necessarily separable from justifying any intended score use. Both must be done to accomplish professionally defensible testing practice. It should also be emphasized that the concerns of validation and justification often interact. Evidence supporting the validity of intended test score inferences is a necessary but insufficient condition for recommending or sustaining a justification for test use. Validity evidence is an essential part and precursor of the justification for the use of a test—but only a part—and one that may carry greater or lesser weight in deliberations concerning the intended use. As Borsboom and Mellenbergh have stated in the context of tests used for selection, placement, or with the goal of bringing about changes in society at large, “validity may play an important role in these processes, but it cannot by itself justify them” (2007, p. 109).
Justification of Test Score Use If the body of validity evidence and conclusions from engaging in Part I of the comprehensive model (i.e., the validation activities) afford an acceptable degree of confidence that the scores yielded by a measure can be interpreted with their intended meaning, the focus then shifts to Part II of the process—the activities of investigating,
gathering evidence, and evaluating the justification for the intended use of a test. If more than one use is contemplated, a justification effort for each use would be required. Figure 5.1 illustrates the second part of a comprehensive model of defensible testing. Reading the model from left to right, it can be seen that the activities comprising the justification process are parallel to those involved in the validation process. First, the justification process begins with the necessary prerequisite of an explicit statement regarding the intended use(s) of scores generated by a test. This statement guides the justification effort, which includes gathering evidence from various sources that is then evaluated with respect to the support it provides for the intended use. Also like the validation component, the evaluation of justification evidence results in a positive or negative overall decision regarding the proposed use. If negative, the use of the measure for the stated purpose is rejected or a more limited or different purpose might then be contemplated. If positive, and if an operational test use occurs, additional information is typically generated (e.g. anticipated benefits and consequences of testing are observed), which can provide additional evidence relevant to justifying the intended use.
Figure 5.1 Comprehensive model of defensible testing, part II: steps in justifying score use Finally—and again parallel to the validation process—Figure 5.1 illustrates that the justification process is affected by and must also recognize the inherent presence and influence of values. A comprehensive model for defensible testing must take into account these diverse perspectives, values, and priorities in a systematic and open manner when the decision to use a test in some intended manner is being made. The answer to whether a test should be used in some intended manner is not automatically determined based on such input. For example, it might be decided to nonetheless use a test for some purpose despite that fact that it may be found to be less effective at accomplishing some desired outcome than
other alternatives, more costly, less likely to reap a benefit, accompanied by undesirable consequences, or is evaluated as unfair to some persons, groups, or interests. Nonetheless, an open consideration of differing stakeholder needs, diverse perspectives, and competing values should occur. With regard to sources of evidence for intended score meaning, there are long-standing and well-developed methodological guidelines and established sources of evidence, drawing primarily on psychometric traditions. The Standards for Educational and Psychological Testing (AERA, APA, & NCME, 2014) are perhaps the foremost example of such guidelines, having first been formulated over 50 years ago and now in their sixth edition. Unlike standards for score meaning, however, similarly long-standing and broadly endorsed methods, standards, and sources of evidence for justifying an intended test use do not exist. Although the psychometric tradition has provided a solid foundation for developing sources of evidence bearing on validity, it seems ill-suited as a framework for developing sources of evidence bearing on justification of test use. It is obvious that there is a critical need to develop sources of evidence for justification of test use to the same refined state as in the current menu of sources of evidence for validation of score interpretations.
A Foundation for Justifications of Intended Test Score Use A potentially appropriate foundation for guidance related to justification of test use is found in the theory and methods of program evaluation. The field of program evaluation seems especially appropriate for three reasons. First, the discipline of program evaluation is a relatively mature field with accepted traditions (see Shadish, Cook, & Leviton, 1991), logic and methods (see Scriven, 1995; Stufflebeam & Zhang, 2017), evidentiary sources (see Donaldson, Christie, & Mark, 2009), and professional standards of practice (see Yarbrough, Shulha, Hopson, & Caruthers, 2011). Second, program evaluation typically addresses applied questions of decision making, merit, worth, or practical significance—foci that are similar to the central questions of test use. Third, and perhaps most importantly given the oftencontested potential uses of test scores, the field of program evaluation has long recognized and incorporated the necessity to identify stakeholders, the realities of differential negotiation acumen and rhetorical skill among stakeholders, the interplay of values and politics in evaluation decision making, and contested notions of desirable outcomes (see Patton, 2008; Stake, 2004)—all of which comprise factors that are clearly relevant when contemplating the use of a test for a given purpose. Exceptionally desirable related to the guidance from the field of program evaluation is the fact that the field of program evaluation has professional standards that are in many ways similar to psychometric best practices embodied in the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 2014). In the area of program evaluation, the
Program Evaluation Standards (Yarbrough, Shulha, Hopson, & Caruthers, 2011) comprise a set of best practices grouped according to five main categories, including: Utility Standards—that are intended to increase the extent to which program stakeholders find evaluation processes and products valuable in meeting their needs; Feasibility Standards—that address the effectiveness and efficiency of the evaluation effort; Propriety Standards—that provide guidance on what is proper, fair, legal, right and just in program evaluations; Accuracy Standards—that aim to increase the dependability and truthfulness of evaluation activities, especially those involving interpretations and judgments about quality; and Accountability Standards—that describe appropriate documentation of evaluations and procedures for evaluating the evaluation processes and products themselves. An example of the kinds of established traditions in the field of program evaluation is found in the model for conducting an evaluation developed by the Centers for Disease Control and Prevention (1999). The CDC model, shown in Figure 5.2, comprises six steps that can be recognized as more than essential elements in the program evaluation model, but as critical steps in justifying the intended uses of test scores. The steps include: • • • • • •
engaging stakeholders; describing the program; focusing the effort; gathering credible evidence; justifying conclusions; and ensuring use and documenting lessons learned.
Figure 5.2 CDC model of program evaluation Source: Centers for Disease Control and Prevention (1999); used by permission.
The model shown in Figure 5.2 shows a process explicitly centered on the Program Evaluation Standards (Yarbrough, Shulha, Hopson, & Caruthers, 2011), and one that is recursive—continually seeking to identify and engage relevant stakeholders, to gather and evaluate evidence, to justify conclusions, and to document and evaluate eventual decisions and uses. Regarding specific methodologies, the field of program evaluation has developed systematic strategies for investigating the economic, social, and organizational issues that often underlie justification of test use. For example, specializations in program evaluation include needs assessment (Altschuld & Watkins, 2014), cost-effectiveness investigations (Levin, 1987; Levin, & McEwan, 2000), cost-benefit approaches (Mishan & Quah, 2007),
and cost-utility analyses (Robinson, 1993). As will be shown, each of these activities can be relevant to obtaining evidence justifying an intended test score use. Finally, it is recognized that the concept of fairness is central to modern validity (see Camilli, 2006), although it is also recognized that the concept of fairness is one that has many meanings and has evolved over time with respect to the classification of examinees based on test performances (see Zwick, 2017; Zwick & Dorans, 2016). However, the concept of fairness has a long tradition of explicit consideration within the field of program evaluation. For example, as noted above, the Program Evaluation Standards explicitly include the concept of fairness in its Propriety standards generally (see above) and in specific Propriety standards statements (see Standard P4).
Sources of Evidence for Justifying an Intended Test Score Use Referring back to the second part of the comprehensive framework for defensible testing shown in Figure 5.1, it can be seen that justification of test use begins with a clear articulation of the intended test score use to be scrutinized. This is analogous to the first step in validation, which comprises clear articulation of the intended score meaning. Adapting the program evaluation model shown in Figure 5.2 to the field of assessment, subsequent critical next steps include gathering and evaluating credible evidence that bears on the intended test score use. The remainder of this section lists and describes the sources of evidence that might be brought to bear in justifying an intended test use. Table 5.2 provides a list and examples of sources of evidence that might be brought to bear when evaluating an intended test score use. The organization of the sources is purposefully similar to the sources of evidence for validation. The first column in the table lists four categories of possible evidence: Evidence Based on Consequences of Testing, Evidence Based on Costs of Testing, Evidence Based on Alternatives to Testing, and Evidence Based on Fairness in Testing. The second column provides examples of each type of evidence. Additional information on each source is presented in the following sections.
Table 5.2 Sources and examples of evidence for validation of score meaning and justification of test use Sources of evidence for justifying test use
Examples
Evidence Based on Consequences of Testing
Evaluation of anticipated benefits Consideration of negative consequences Consideration of false positive, false negative rates Evidence Based on Costs of Testing Overall cost of testing Cost-benefit, cost-effectiveness, cost-utility analyses Consideration of opportunity costs Evidence Based on Alternatives to Testing Evaluation of relative value of alternative testing methods, formats, or procedures Evaluation of non-test options to accomplish intended goals Evidence Based on Fairness in Testing Evaluation of stakeholder inclusion Investigation of opportunity to learn Provision of due notice Examination of disparate impact across groups
Evidence Based on Consequences of Testing It is equally clear that consequences of testing are not typically a source of validity evidence as it is that consequences must be accounted for in a comprehensive model of defensible testing. A critical distinction is also relevant here: when speaking of consequences of testing, it is not necessarily the consequences of testing, per se, that are the primary concern, but rather the consequences of a policy that involves testing. The consequences of simply requiring a test may have one set of consequences (e.g., it may reduce available instructional time in elementary or secondary schools, it may inconvenience candidates who must drive to a distant test center, it may increase anxiety in test takers of any age or in any context, etc.). However, it is often the comparatively more serious consequences associated with test performance and associated policies or uses of test scores that are of greatest concern regarding consequences of testing. For example, some policy consequences of testing can include promotion/retention decisions in schools, selection for desirable/less desirable residency placements, qualification for scholarships, identification for a career or educational opportunities, and so on. Importantly, whether the comparatively less serious consequences associated with requiring a test or the comparatively more serious consequences associated with a policy involving test data, evidence based on consequences of testing can provide strong justification for the decision to even have a test in the first place or for any specific intended use of test scores. Intended and Unintended Consequences Although not phrased in terms of consequences, the chapter on validity in the current Standards includes a strong admonition related to consequences of testing, noting that: When it is clearly stated or implied that a recommended test score interpretation for a
given use will result in a specific outcome, the basis for expecting that outcome should be presented with relevant evidence. (AERA, APA, & NCME, 2014, p. 14) Put perhaps more simply, when a claim is made about an intended consequence of testing, evidence should be provided that justifies the claim. In such cases, evidence based on consequences of testing can take many forms, and the consequences, benefits, or outcomes may be intended (“stated or implied” in the wording of the Standards) or they may be unintended. On the one hand, examination of anticipated positive consequences or purported benefits may provide justification for the use of a test. For example, in medical professions contexts, public protection is often asserted as an intended consequence of requiring testing for licensure or certification. To the extent that patients receive care that is safer and more effective, an intended benefit of testing accrues. In education contexts, tests administered as part of accountability systems are sometimes mandated with the explicit or implied expectation that their presence will enhance student engagement and motivation, promote increased educator effort, or stimulate increased learning and achievement. Even if only used by legislators and policy makers to monitor expenditures and their effectiveness in providing appropriate public education opportunities, a benefit is accrued. In both contexts, such consequences would represent desirable benefits and would provide positive evidence in support of the use of a test. On the other hand, potential negative consequences of test use might be anticipated. For example, the use of a test—a gatekeeping mechanism—in the health professions may discourage members of underrepresented demographic groups from pursuing licensure or certification. Credentialing examinations can limit access to a profession for reasons unrelated to candidate knowledge and skill. In education, the use of a test as a high school graduation requirement can alter student course-taking patterns, increase test anxiety in students, lead to increased rates of dropping out of school, or prompt unethical behavior (e.g., cheating) on the part of examinees and educators in both educational (see Cizek, 1999, Cizek & Wollack, 2017) and credentialing (see Wollack & Cizek, 2017) contexts. Negative consequences such as these that can be anticipated represent disincentives to the use of a test, should be examined, and could weigh strongly in a decision not to use a test for an intended purpose. In addition to the consequences of testing that can be anticipated, there is the potential for unintended consequences to accrue. For example, some high school students may not apply to more selective colleges if they view their college admissions test scores as “ceilings” (an unintended negative consequence). Some candidates for advanced certification in a nursing specialty area may seek out additional professional development or pro bono service
opportunities to enhance their knowledge, experience, and clinical skills (an unintended positive consequence). Although, by definition, unintended consequences cannot be studied in advance, to the extent that they arise over the course of a testing program, information on those consequences can be gathered, and that information may be a valuable source of evidence bearing on the justification of the test’s use. Consequences of Classification Errors A third source of evidence based on consequences of testing centers on the inevitable result that classification errors will occur. That is, testing often results in the assignment of test takers to performance categories such as Pass/Fail, Admit/Reject, Certify/Deny Certification, and so on. Ideally, candidates who truly have the necessary knowledge or skill would obtain test scores that meet or exceed the criterion (true positive decisions) and those who truly lack the judged level of knowledge or skill would fail to meet the criterion (true negative decisions). The reality is also that sometimes classification errors are made. Candidates who do not truly possess the level of knowledge, skill, or ability deemed necessary may nonetheless pass a test (often due to random measurement error, cheating, or other factors); such classification errors are often referred to as false positive decisions. Candidates whose knowledge and skill truly exceeds the criterion for certification but whose performance falls below the established criterion (again often due to random errors or other factors) may be improperly denied a license or credential; such classification errors are called false negative decisions. In considering the consequences of using a test, it is necessary to explicitly consider the estimated frequencies, relative proportions, and values related to the relative seriousness of false positive and false negative classifications. The justification for using test scores for a given purpose may be called into question when the ratio of false positive to false negative classification decisions is judged to be unacceptable, when the seriousness of one type of error is judged to be incommensurate with the benefits of testing, or when reasonable estimates of the frequency and relative proportions of the two kinds of errors cannot be obtained. To illustrate these kinds of classification errors and the value considerations that must be weighed when considering a test use, let us consider two contexts—one educational achievement testing, one medical professions licensure testing. In the educational testing context, suppose that a statewide student achievement testing program mandated that students must score at certain level in reading in order to be promoted from third to fourth grade. For various reasons, some students who actually do possess the level of reading skill required may fail to obtain the required score on the test—a false negative classification decision. For such students, what are the impacts of the false negative classification (and potential retention) on their motivation, self-concept, peer relationships, persistence, and
other social and academic outcomes? Other students who do not possess the level of reading skill required may pass the test—a false positive classification decision. For those students, what are the consequences of being promoted to fourth grade lacking the judged level of reading skill required to be successful in that grade—and for the future. And, which classification decision error is more serious, requiring a student to repeat third grade who does not truly need the remedial year, or allowing a student to progress to fourth grade who is underprepared? Even when classification decisions are correct, there are implications for the justification of using a test for a given purpose. For example, for students who failed to obtain a passing score and who truly lacked the necessary level of reading skill—true negative classification decisions—would retention likely provide them the needed remediation? Would they be equally likely to progress in reading skill if promoted? What would be the consequences for the school system in terms of the additional staff and space resources necessary to provide remediation or expand the number of third grade enrollment slots because of retentions? What would be the consequences for families if failing students were required to attend summer school? Even the very notion of summer school placement as a consequence can have a different valence to different stakeholders: it is possible that parents may tend to see requiring a student to attend summer school as punitive, whereas educators may tend view it as valuable educational opportunity that will prove beneficial to students’ long-term educational success. In short, consequences themselves are not by their nature positive or negative, but dependent on the values and perspectives that vary across persons and groups. Turning briefly to the medical licensure context, a false positive classification decision would involve awarding a license to practice medicine to a candidate who lacked the knowledge and skill for the safe and effective practice of medicine—a serious concern, given the harm that an unqualified physician could inflict on his or her future patients. A false negative classification decision would involve denying a license to practice medicine to a candidate who would truly be a safe and effective practitioner—a serious concern given the years and expense that the candidate has invested into a career from which he or she is now prevented from entering. And which of these classification errors is more serious? How much more serious is one than the other? And how should it be taken into account if the use of a test results in one type of classification error judged to be very serious occurring far less frequently than another, less serious, type of misclassification that occurs more frequently? Conclusions about Consequences Overall, four conclusions seem warranted regarding consequences of testing. First, it is important to restate that all of these eventualities outlined in the educational and medical scenarios just described are not consequences of the test per se—it may be a psychometrically valid and dependable reading comprehension measure or medical licensure
examination—but the question remains as to whether it is justifiable to use those tests as specified by the particular educational or social policy that requires them. Second, as the examples of the use of a test to inform promotion/retention decisions or medical licensure illustrate, the use of a test can have serious consequences for students, families, organizations, and the public. Research to gather evidence related to those consequences—intended and unintended, negative and positive—and the weighing and evaluation of that evidence must take place when deciding whether it is justifiable to use a test for a specific purpose. Third, it is not enough to make assumptions about potential consequences of testing— positive or negative. Rather, anticipated consequences should be examined to the extent possible in advance of using test scores in some consequential manner; unanticipated consequences should be investigated as they arise over the lifetime of a test to provide evidence justifying (or not) its continued use. Finally, because all testing is susceptible to errors, misclassifications can occur. As the consideration of false positive and false negative classification decisions illustrated, some classification errors are more serious than others, the relative proportions of false negative and false positive decisions may differ, and the frequency of any type of classification error may be greater or fewer. These aspects of classification errors must be formally considered with the realization that different stakeholders not only bring different perspectives on the seriousness of the various kinds of classification error but can also view the same outcome as a positive or a negative consequence.
Evidence Based on Costs of Testing Developing and administering a test can be costly. Estimates vary, but the cost of developing a single multiple-choice test item for a high-quality licensure examination can be $1,000– $2,000 per item or higher. The cost of testing center time for administration of a certification test run is approximately $15–$20 per hour. Using the upper end of those estimates, the total cost to develop and administer a single four-hour, 300-item credentialing test to 5000 examinees is in the neighborhood of $1 million dollars. That total is considerably less than the national financial outlay for accountability testing in K–12 education. A recent estimate of annual expenditures on mandated educational achievement testing pegged the total at $1.7 billion (Chingos, 2012). However, even an accurate accounting of monetary costs is a piece of evidence that must be contextualized and ascribed with some value. For example, a national expenditure of $1.7 billion on K–12 educational achievement testing might seem like a large amount, but it certainly seems smaller when compared to U.S. spending for K–12 education generally, which in 2019 tops $700 billion, making testing expenditures only onequarter of one percent of the total. The cost per student to take a summative achievement test offered by the Smarter Balanced or Partnership for Assessment of Readiness for College and
Careers (PARCC) consortia of states seems downright tiny at approximately $22–$29 per student (Gewertz, 2013). Of course, those monetary expenditures are only one type of cost that might be considered. There are other personnel costs in terms of time that are not included in those figures. Test takers spend time studying for tests; instructors spend time helping them prepare for tests; systems and procedures must be put in place to oversee testing and evaluate test results. Overall, it is perhaps impossible to accurately estimate the total monetary, human, time, organizational, and societal costs even for a modest testing program. Are these costs worth it? The data collections necessary for answering that question provide a source of evidence that can justify (or not) the use of a test. The overall cost of testing includes not only an accurate accounting of monetary expenditures, but also a full accounting of all other costs such as candidate and testing personnel time, physical space, and so on. Would it be “worth” administering a medical licensure test if it cost only $100 to develop each item and only $10 per hour of testing center seat time (less than half the current cost)? Would it be worth it at twice the current cost? Would mandated K–12 student assessment programs be “worth” it if they consumed twice as much (or half as much) instructional time? Even with reasonable estimates of overall cost, the question of worth must be answered, where worth is defined in terms of some good that is accrued. And, notably, even if a good could be identified and quantified, it would still be subject to differential valuing. That is, different stakeholders may view one outcome as an unequivocal good; others may view the same outcome less enthusiastically. Even stakeholders who evaluate the relative value of the identified good in the same way might weigh it differentially in comparison to the identified costs. Here, too, the traditions and methods of program evaluation can be helpful in that they provide established procedures for identifying, quantifying, and comparing the costs and benefits associated with testing. Although there are many useful approaches to accomplish this in the program evaluation literature, three selected approaches are described in the following sections: cost-benefit analysis, cost-effectiveness analysis, and cost-utility analysis. Cost-Benefit Analyses Cost-benefit analyses answer the question, “What is the ratio of costs of testing to the benefits of testing?” In cost-benefit analyses, both the costs and the benefits must be identified and expressed in dollars. For example, estimates of the costs per student such as those identified for the Smarter Balanced or PARCC testing programs could be used to obtain an overall monetary cost for statewide testing. If deemed appropriate, local district personnel time to administer the tests and state personnel time to oversee the system could be estimated to arrive at a total cost. In order for cost-benefit analysis results to provide strong justification evidence, however, there must be agreement among stakeholders on the costs to be included
and those costs much be accurately estimated. That is difficult enough. Beyond this, costbenefit analyses also require that the benefits of testing be identified and monetized. It may be challenging to identify the potential benefits that might accrue from testing, including many intangible benefits (e.g., heightened taxpayer support of the state’s educational system, formative use of data for instructional decision making by educators, enhanced employability of the state’s high school graduates, increased motivation for students and families to take responsibility for educational goal-setting and progress monitoring, and so on); it is likely to be even more difficult or impossible to quantify those benefits in dollars. Table 5.3 provides a hypothetical illustration of a cost-benefit analysis. In the first column of the table, three testing strategies that a small school district is considering are listed: a traditional paper-and-pencil test, a computer-adaptive test, and individual student oral or performance examinations. For the sake of the illustration, let us assume that the district has obtained concrete estimates of all equipment, material, personnel and other costs involved from a testing vendor; these cost estimates are shown in the second column. The third column of the table gives the monetized estimates of benefits. These estimates may have included benefits in terms of personnel time savings, ease of integration with existing systems, increased student achievement or other outcomes. The Net Benefit column is simply the dollar amount listed in the Cost column minus the amount listed in the Benefit column. A positive amount in the Net Benefit column reveals an overall benefit of implementing a given testing strategy; a negative amount indicates a net loss; and zero indicates that the costs and benefits associated with a given strategy are equal. Finally, the C/B column provides the cost/benefit ratio. A value of 1.0 indicates that the costs and benefits are the same; values greater than 1.0 indicate lesser overall benefits relative to costs; values less than 1.0 reveal greater “bang for the buck.” In general, the results for Net Benefit and C/B permit consideration of which alternative provides the greatest net benefit (in dollars), and which has the lowest cost-to-benefit ratio. Table 5.3 Example of cost-benefit (C/B) analysis Testing strategy
Cost ($)
Benefit ($)
Net benefit ($)
C/B ratio
Paper-and-pencil multiple-choice testing Computer-based adaptive testing Individualized oral examinations or performance testing
200,000 350,000 500,000
200,000 225,000 750,000
0 −125,000 250,000
1.00 2.00 .67
In the hypothetical illustration shown in Figure 5.3, computer-based adaptive testing presents greater costs than the benefits it realizes; costs and benefits are equal for traditional paper-and-pencil testing; and individualized examinations provide somewhat greater benefits despite their substantially greater cost. When used as part of evidence gathering to justify the use of a test, cost-benefit analyses can provide cost-based comparisons to evaluate the relative value of a test.
Figure 5.3 Complete comprehensive model of defensible testing Cost-Effectiveness Analyses Cost-effectiveness analyses provide another potential source of evidence to justify the use of a test. Like cost-benefit analyses, the cost(s) of some intervention—such as developing and administering a test—must be estimated. However, unlike cost-benefit analyses in which the outcome must also be monetized, cost-effectiveness analyses consider other types of outcomes. The key question answered by a cost-effectiveness analysis is, “How much does it cost to obtain a given amount of x?” where x is the valued outcome being studied. Such an investigation allows the comparison costs of various alternatives for obtaining the desired outcome; and an advantage of cost-effectiveness analyses is that they do not require cost estimates for variables that are often of greatest interest, but for which it is also most likely to be difficult to ascribe monetary value (e.g., achievement, motivation, effort, attendance, persistence, collaboration, ethical practice, and so on). Table 5.4 provides an illustration of a cost-effectiveness analysis. In this hypothetical situation, a health professions training program is considering various approaches for aiding their candidates to prepare for the profession’s credentialing examination. Listed in the first column of the table are the five potential methods under consideration: paying for candidates to attend a commercial test preparation program; assigning a faculty member to lead group review sessions; assigning all faculty members to conduct individual review sessions with each candidate; providing a conference room for and organizing peer review sessions of small groups of candidates; and purchasing a site license for review software specific to the profession. The second column indicates the cost per candidate of each option. Values in the third column (obtained from vendor pricing, from past experience with the options, or from a review of literature) provide the effect sizes in terms of the average raw test score increase associated with each option. The final column provides the cost-effectiveness ratio—that is,
the cost to obtain a one-unit increase in the outcome (i.e., a one point test score increase). Table 5.4 Example of cost-effectiveness (C/E) analysis Method
Cost per candidate ($)
Effectiveness (average raw test score increase)
C/E ratio ($)
Commercial test preparation Guided group review sessions Individual candidate review sessions Peer review sessions Site-licensed review software
500.00 5,000.00 50,000.00 50.00 2,500.00
5 8 12 6 8
100.00 625.00 4,166.67 8.33 312.50
As a source of justification evidence, it can be seen that, although the peer review sessions provide, on average, a very low return in terms of test score increases, they do so at a cost that is substantially lower per unit of increase than most other options. Although the hypothetical situation illustrated in Table 5.4 is not strictly relevant to justifying a test use, it is easy to see how the methodology could be applied in the same way to such situations. For example, suppose a school district was considering five different commercial interim testing programs as a means of aiding its students in preparing for the state’s end-of-year summative accountability test. Revising Table 5.4 only slightly, the Method column would list the five options; column two would list the vendors’ costs per student; the third column would provide the average summative score gains; then the cost-effectiveness ratios would be listed in the far right column. Importantly, such an analysis assumes that each of the options being considered provides equally valid information about student learning; given that assumption, information on cost-effectiveness would aid the district in justifying the use of one of the options over the others. Cost-Utility Analyses A final approach from the program evaluation literature that can be adapted for justifying an intended test score use is cost-utility analysis. Cost-utility analysis answers the question, “What is the relative likelihood and value of doing A versus B?” To examine justifying the use of a test for a given purpose, let A represent using a test under consideration, and let B represent a second, non-test option. Cost-utility analyses are comparatively less dataintensive than other approaches, requiring only accurate estimates of the cost of the options and relying on qualified judgments about the probability and value of specified outcomes. Table 5.5 provides an illustration of a cost-utility analysis. In the hypothetical situation presented in the table, a high school is considering using a system of interim tests or hiring several “catch-up coaches” as potential options for increasing student achievement in reading and mathematics. To conduct a cost-utility analysis, the school obtains the costs of purchasing the interim tests and hiring the coaches (shown in the Cost row of the table) and then must gather judgments on two key issues: (1) judgments about the probability that each
of the options will result in, on average, at least a one-achievement-level increase in reading and mathematics for students, and (2) the value (i.e., “utility”) if those increases were obtained. Such information is often obtained in a survey of qualified persons (likely experienced mathematics and reading teachers) who are asked to indicate the probabilities for each option on a scale of 0.0 to 1.0 and the utilities on a scale of 0 to 10. The survey responses are averaged to obtain the entries for the respective cells in the table. The judgments about probabilities and utilities are then combined as shown in the Expected Utility row of the table. When the cost of each option is divided by its expected utility, a Cost-Utility Ratio results. This ratio provides a monetized estimate of the cost of obtaining each outcome, weighted by its probability and utility. As shown in the table, the most feasible option for the school would be the use of catch-up coaches as they provide the greatest judged utility and probability of realization at the lowest cost. Table 5.5 Example of cost-utility analysis
Options Probability of raising mathematics performance by one achievement level Probability of raising reading performance by one achievement level Utility of raising mathematics performance by one achievement level Utility of raising reading performance by one achievement level Expected utility Cost ($) Cost-utility ratio ($)
Interim testing
“Catch-up coaches”
.5 .5 6 9 [(.5)(6)+(.5)(9)] = 7.5 375 50.99
.3 .8 6 9 [(.3)(6)+(.8)(9)] = 9.0 400 44.44
The usefulness of cost-utility analyses for justifying an intended test use is clear. Costutility analyses provide a method for comparing costs of various options on the basis of their judged value. In the hypothetical situation illustrated in Table 5.5, the school would obtain evidence that it was not as justifiable to purchase the interim testing solution as it would be to hire the intervention personnel. Just as clearly, perhaps, the analysis relies on accurate estimates of cost and sound human input regarding the probabilities and utility judgments. Opportunity Costs A final topic that must be considered as a source of Evidence Based on Costs of Testing is the concept of opportunity costs. When evaluating whether it is justified to use a test for a given purpose, the following concern often arises: Because funds are always limited, if an entity allocates funds for the purchase or development and administration of a test, what other potentially valuable expenditures can now not be made? In general terms, opportunity costs refer to the losses in potential gains from other alternatives that are not chosen when one alternative is chosen. For example, a credentialing board may decide to implement a
recertification program. The funding that the board allocates to the development and administration of a recertification examination could have been allocated to other initiatives. Among the nearly unlimited alternative allocation would be: • providing professional development programming for its members; • developing and distributing a new professional journal for practitioners; • reconfiguring its entry-level examination from a fixed-form to a computer-adaptive format; • lowering the cost for attendance at its annual professional meeting for candidates in training as a means of broadening participation; or • raising salaries for board staff members as a mechanism for attracting and retaining the best human resources for the organization. In short, when gathering evidence based on the costs of testing as part of the effort to determine whether it is justifiable to implement a test for some intended use, it is possible that other cost analyses such as those described in the preceding section may reveal that it is highly effective, beneficial, or useful to do so. However, even under those conditions, the costs allocated to the implementation of a test could have been allocated in other ways unrelated to testing and those options are precluded when funding is allocated to testing. Careful identification of the lost opportunities and the extent to which they are judged to be more or less valuable than an intended test use should be conducted as part of the justification effort. Conclusions about Costs of Testing In nearly all cases, gathering and evaluating Evidence Based on Costs of Testing relies as much or more on human judgments about worth or value as it does on the accurate estimation of any other quantities involved in an analysis. This reality is captured in the justification portion of the model of defensible testing illustrated in Figure 5.1 and highlighted as the pervasive influence of values represented at the bottom of the figure. Answers to questions about whether a cost is “worth it” depend on stakeholders’ perspectives, investments in the testing process, priorities regarding organizational, professional, societal or other goods, and myriad other factors. At minimum, the potentially differing views on worth or value should be captured and considered in the justification process.
Evidence Based on Alternatives to Testing A third general source of information that can be brought to bear when evaluating whether some intended test use can be justified is Evidence Based on Alternatives to Testing. A first and most basic approach to this question was suggested by Messick (1988) who, in
contemplating how to proceed with the evaluation of social consequences of testing, suggested that a beginning point was to consider “the potential social consequences … [of] not testing at all” (p. 40). Surely this is a legitimate source of evidence. If it is judged that any negative consequences associated with testing are so severe compared to the negative consequences of not testing, then not testing at all would seem justified. Of course—and Messick recognized this—it is typically not only difficult to estimate all of the negative consequences of testing, it is perhaps impossible to estimate the negative consequences of not testing. An example can be seen in the formerly widespread administration of intelligence tests which has substantially diminished. In the recent past, such tests were routinely used to aid in the identification of talented students in academic settings, to provide information relevant to selection in personnel matters, and as a factor in promotion decisions in some occupational contexts. Cognitive measures that are similar to traditional intelligence tests are still used by many organizations to help them make informed hiring decisions, improve employee retention, and predict job performance (see Wonderlic, Inc. 2019). Although large-scale experiments have not been conducted, it would be possible to conduct research that compares outcomes such as academic achievement, employee productivity, and clinical decision making for samples identified based on test results versus samples selected without such testing. Another source of Evidence Based on Alternatives to Testing is the evaluation of non-test options designed to achieve the same goals as administration of a test. In nearly all situations, various non-test options can be identified. First, the term “test” is used in this context to refer to a formal, standardized instrument, such as a multiple-choice test or set of performance tasks. When considering whether it is justified to use such tests for a given purpose, many non-test alternatives exist, such as interviews, evaluations of past performance, supervisor ratings, clinical observations, and self-reports. If such non-test options provide valid and dependable information, or if they provide credible information that can be obtained more easily, in a less intrusive, less costly, or more efficient manner, then it may be difficult to justify the use of a more intrusive, burdensome, difficult, or inefficient approach. For example, the costs and time burden associated with a series of tests conducted as part of health care screening for sleep apnea during an overnight clinical stay may not be justified if an alternative self-report of patients’ symptoms was equally effective.
Evidence Based on Fairness in Testing. The fourth source of evidence that should be brought to bear when considering whether it is justified to use a test for an intended purpose is Evidence Based on Fairness in Testing. As mentioned previously, there is no consensus on the meaning of the term fairness, in general, or regarding whether any particular intended use of a test is “fair” (see Zwick, 2017; Zwick
& Dorans, 2016). Indeed, because the term “fairness” is often considered in relation to the performances of differing ethnic, social, gender or other group demographic characteristics, fairness analyses are often problematic when the very definitions of group membership are not well specified or agreed upon. Nevertheless, fairness must be a primary consideration in a comprehensive model of defensible testing in at least two ways. First, the Standards (AERA, APA, & NCME, 2014) and other references typically—and rightfully—describe fairness in terms of validity. This comprises a first aspect of Evidence Based on Fairness in Testing: a test cannot justifiably be considered for use without a strong case affirming that scores generated by the test can confidently be interpreted to have their intended meaning. Beyond this, in some cases it may be important to demonstrate that what consumers of test scores believe that a test score represents is consistent with what test developers intend for the test scores to mean (O’Leary, Hattie, & Griffin, 2017) as a source of information justifying a proposed test use. Fairness concerns in testing also routinely focus on relationships between test performance and group membership. That is, a test is “fair” when the same inferences about examinee knowledge, skill, or ability are made for examinees scoring at the same level on a test, regardless of the examinees’ group membership. However, fairness concerns can arise beyond the scope of validity when considering whether a particular intended test use is justifiable. The use of even psychometrically valid scores with strong support for their intended interpretations may nonetheless be considered to be unfair and a proposed use deemed unjustified. The effort to promote fairness in testing is a two-pronged endeavor. First, fairness should be considered in test design, development, and administration—efforts that promote validity. As Camilli (2006) notes, “it is generally agreed that tests should be thoughtfully developed and that the conditions of testing should be reasonable and equitable for all students” (p. 221). Second, fairness should be considered in evaluation of test score use and in test impact. There are many specific statistical techniques for investigating fairness depending on the definition of fairness that one adopts, including differential item and test functioning analyses, adverse impact analyses, differential prediction analyses, and others. Readers are referred to Chapters 2 and 4 of this volume for information on some procedures to investigate score validity and to the chapter by Camilli in Educational Measurement, fourth edition (Brennan, 2006) for more detailed information on statistical approaches. However, evaluations of fairness in test score use and impact go beyond statistical considerations and “are inevitably shaped by the particular social context in which they are embedded” (Camilli, 2006, p. 221). An introduction to four fairness concerns is presented in the following sections. Stakeholder Input
A first fairness concern when examining whether it is justified to use a test for an intended purpose is the extent to which relevant stakeholders were included in the deliberations and had effective input into the decision. In the field of program evaluation, stakeholders are defined as persons who fund a program, who have decision-making authority regarding a program, who implement a program, who have direct responsibility for a program, who are the intended beneficiaries of a program, or who may be disadvantaged by a program. Clearly, this listing captures many groups, interests, levels of knowledge about a program, and degrees of power or influence. Kane (2006a) has also recognized that the potential stakeholders in any testing process are diverse and numerous, and he notes that “any consequences that are considered relevant by stakeholders are potentially relevant to the evaluation of how well a decision procedure is working” (p. 56). Additionally, Kane has observed that stakeholders should be polled for their input because “many different kinds of evidence may be relevant to the evaluation of the consequences of an assessment system … and many individuals, groups, and organizations may be involved (2001, p. 338). There are at least three key decision points related to stakeholder inclusion; among them are: (1) identifying relevant stakeholders; (2) obtaining input from identified stakeholders; and (3) addressing power (economic, academic, social, rhetorical) differentials among stakeholders. It is safe to say that current practice often does not routinely identify diverse stakeholder groups, does not actively seek out diverse stakeholder input when the decision is made to use a test for a specified purpose, and does not consider stakeholder power differentials. For example, it would not ordinarily be the case that marriage therapy clients or journal editors would be consulted as to whether a therapy effectiveness instrument should be used as preand post-measure; rather, that decision would ordinarily be made by a comparatively narrow set of stakeholders including, perhaps, clinicians who would likely use the instrumentation in consultation with those developing it. On the other hand, sometimes broader stakeholder input is sought when contemplating an intended test use. Legislatures, governmental agencies, or other regulatory bodies may hold public hearings on, for example, whether passing an exit examination should be required for high school graduation, with parents, educators, PTA representatives, college admissions officers, prospective employers and even students afforded an opportunity to provide input on the decision. Even in situations where these more inclusive processes are in place, there are rarely if ever procedures in place to ensure that the perspectives presented are explicitly considered in arriving at a decision about
test use, nor are there explicit guidelines available for how to accomplish an equitable synthesis. To be sure, in some cases there may be a sound rationale for not incorporating some perspectives into a test use decision. However, even in such cases, it would seem incumbent on those making a test use decision to explicate the basis on which the decision was made, how stakeholder input was incorporated, and rationales for why any perspectives were discounted. To be clear, the justification process and the consideration of Evidence Based on Test Fairness is not exclusively—and sometimes perhaps not at all—within the purview of psychometric expertise or primarily led by those responsible for the validation effort. Regarding the fairness concerns that can arise in the course of justifying an intended test use, Kane has noted that “the measurement community does not control the agenda; the larger community decides on the questions to be asked” (2006a, p. 56). Although Kane appears to allow for even the intended score interpretation to be negotiated, stating that “agreement on interpretations and uses may require negotiations among stakeholders about the conclusions to be drawn” (p. 60), formal responsibility for and control over intended score interpretations should likely fall squarely and exclusively on the test developer and those who conduct the validation effort. By contrast, the entity in the best position to organize and conduct the justification effort is the entity that proposes or has the authority to specify a particular test use. That entity will need to attend to the issues of stakeholder identification, solicitation of stakeholder input, and power differentials among stakeholders. Surely, conflicts will arise. Next steps in gathering and considering evidence based on test fairness to justify a test’s use will be the development methods for conducting and arbitrating what can be (at least in high-stakes contexts) contentious negotiations and for avoiding and addressing conflicts that are likely to arise in the presence of contending constituencies or competing stakeholder interests. Opportunity to Learn A rhetorically powerful concept in educational research and testing is the notion of opportunity to learn (OTL). McDonnell has labeled OTL as a “generative concept” (1995, p. 305) alongside such other powerful concepts as individual differences and differentiated curriculum; her work traces the introduction of the OTL concept to First International Mathematics Survey in the early 1960s. In that research, and subsequent research on mathematics and science learning, it was demonstrated—perhaps predictably—that student achievement in an area was related to the exposure to and opportunities to learn the tested content. As regards OTL as a source of Evidence Based on Fairness in Testing, at least on the surface, nothing would seem more unfair than requiring an examinee to take a test covering knowledge or skills that he or she has not had the opportunity to learn. As it turns out,
justification for the use of a test must approach OTL cautiously, however. For example, suppose it was proposed to use a counseling skills inventory as a pre- and post-test measure in a counselor training program. Presumably, the needed skills would not be possessed by candidates prior to their exposure to those skills in the associated courses and internship experiences. It would not seem to be a source of unfairness to identify that, indeed, the intraining counselors lacked those skills prior to beginning the program (and presumably acquired the skills subsequently). Perhaps somewhat controversial, but the same logical analysis applies to the case of tenth graders who must pass a state-mandated civics test in order to graduate from high school. On the one hand, from the perspective of validation of intended test score meaning, it would actually be supportive validity evidence if a group of tenth graders whose civics course teacher had not covered the prescribed content were to perform poorly on the test. On the other hand, from the perspective of justification of the intended test use, it would seem to raise fairness concerns if students were not eligible for graduation because of the failure of their teacher to cover the required civics content. In summary, OTL is not necessarily a matter of unfairness. However, when OTL (or, more precisely, the lack of OTL) has the potential to deny benefits, rights, or status because of factors outside the control of examinees, the OTL is a serious threat to fairness in testing. In all such cases, rigorous studies should be conducted to ascertain the extent to which OTL has occurred. Such studies might include review of course syllabi, analyses of the content of field experiences, observations of instruction, or surveys of enacted curriculum. Also relevant, of course, is the extent to which examinees took advantage of opportunities to learn that were provided. For example, the fact that some examinees may have had high absenteeism, may have opted not to take relevant courses, may have not fully participated in internship opportunities, etc., would need to be considered when evaluating the extent to which OTL represents a source of unfairness that argues against justifying an intended test use. Due Notice Perhaps one of the most recognizable cases in testing law is that of Debra P. v. Turlington (1979). In 1978, the state of Florida legislature amended the Educational Accountability Act of 1976 to condition the receipt of a standards high school diploma on passing a functional literacy test. Students who met all other requirements for graduation but who did not pass the Florida Functional Literacy Examination (FFLE) were ineligible to receive a standard diploma, but instead received a certificate of completion. On the first administration of the test in October 1977, approximately 41,724 of the 115,901 (36%) of Florida high school students taking the FFLE failed one or both sections of the test. Debra P. was one of a larger class of students who asserted that the test violated various constitutional protections. One of the issues in the trial that the court addressed in its 1979 opinion was a clear concern regarding OTL: the court held that the state of Florida had not made any effort to
make certain whether the test covered material actually studied in the classrooms. Beyond the OTL issue, plaintiffs represented in Debra P. claimed that the requirement to pass the test was fundamentally unfair because a due process protection of adequate due notice of the requirement had not been provided. Under the 14th amendment to the U.S. Constitution, no state shall “deprive any person of life, liberty, or property, without due process of law.” The trial court found that Florida students, having taken all required courses and completed other relevant requirements in the state’s compulsory education system, had a legitimate expectation of receiving a standard high school diploma—i.e., a “property” to which they were entitled and which could not be denied without due process. Notably, a person can be denied of life, liberty, or property, but not before the person receives a fair hearing that follows a prescribed process. The Debra P. case explicitly considered what process was due. Among other things, the court held that the implementation of the requirement to pass the FFLE was too hasty: The Court finds the facts in the instant case compelling. The Plaintiffs, after spending ten years in schools where their attendance was compelled, were informed of a requirement concerning skills which, if taught, should have been taught in grades they had long since completed. While it is impossible to determine if all the skills were taught to all the students, it is obvious that the instruction given was not presented in an educational atmosphere directed by the existence of specific objectives and stimulated throughout the period of instruction by a diploma sanction. These are the two ingredients which the Defendants assert are essential to the program at the present time. The Court is of the opinion that the inadequacy of the notice provided prior to the invocation of the diploma sanction, the objectives, and the test is a violation of the due process clause. (Debra P. v. Turlington, 1979, p. 267) In short, the fairness issue was one of due notice—a concept which arises frequently whenever an examination is mandated for an intended purpose, whenever changes in examination content are implemented, or whenever passing scores on examinations are established or adjusted. In such cases, examinees must be provided with adequate notice of the requirements or changes, and must have adequate time to prepare for them. For example, a credentialing board might conduct regular job analysis surveys of practitioners to maintain currency in terms of what entry-level candidates should be expected to know and be able to do. This would provide validity evidence based on test content, but would not be sufficient to justify requiring candidates to be proficient in the new content on a licensure examination. Justification for basing licensure on scores from the new content specifications would require evidence that the new content areas to be included were communicated to examinees and
included in preparation programs sufficiently in advance of administration of the new examinations to allow candidates (and training programs) to master them. The degrees of change that might trigger concern about due notice will vary across specific circumstances, as will the length of the due notice period that would be adequate. However, any consequential testing program should assess the need and methods for providing such notice when deemed necessary. Disparate Impact A final source of Evidence Based on Fairness in Testing is that of differential or disparate impact. According to Camilli: “disparate impact describes group differences in test performance that result in different group proportions of candidates identified for selection or placement” (2006, p. 225). Disparate impact exists when the consequences of applying a test requirement differentially affect groups of interest. Two examples would include: (1) a knowledge test used to identify candidates for firefighter training might disproportionately identify more African Americans than Asian Americans as qualified for admission to the training program; and (2) a personality test used to select candidates for flight attendant training might disproportionately identify more female than male applicants. In practice, a “four-fifths” rule is typically used as a threshold for labeling an observed difference in impact as disparate; the four-fifths guideline describes the results of selection or placement when one group is selected at four-fifths of the selection or placement of another group (see Uniform Guidelines on Employee Selection Procedures, 1985). A first question that must be addressed when considering differential impact is the validity of the test scores. Sources of evidence regarding whether test scores can be confidently interpreted to mean what they are intended to mean (e.g., aptitude for training as a firefighter or flight attendant) must first be collected and evaluated. To the extent that scores are differentially valid for the groups being studied, bias (i.e., invalidity of scores) would be a primary concern. Assuming that there is adequate evidence of score validity, a second concern might still exist regarding the fairness of using a test with differential impact for a given purpose. While a thorough legal presentation of this concern is beyond the scope of this book, the justification of using a test for a given purpose when differential impact is observed depends on the degree of the differential impact and on whether there exists a compelling need to use the test as intended in the face of that impact. Additional information on disparate impact and relevant legal standards can be found in Camilli (2006) and Phillips and Camara (2006).
Summary of Sources of Evidence for Justification of Test Use Procedures and sources of evidence that might be gathered for justifying an intended test use
have not been formalized to the same extent as procedures and sources of evidence for validating intended test score inferences. In this chapter, four sources of evidence for justifying test use were described: • • • •
Evidence Based on Consequences of Testing; Evidence Based on Costs of Testing; Evidence Based on Alternatives to Testing; and Evidence Based on Fairness in Testing.
Like the sources of evidence for validating an intended score meaning, not all of the possible sources of evidence for justifying a test use would ordinarily be pursued. The particular constellation of evidentiary sources is necessarily linked to the specific concerns or questions of interest, stakeholder concerns, and stakes involved for test takers, consumers of test information, and the entity responsible for the testing program. Finally, if more than one test use is contemplated, a justification effort for each use would be required.
Comparing Validation and Justification Having described the separate processes of validating intended test score meaning and justifying intended test score use(s), it is now possible to compare those activities on at least seven dimensions. The seven dimensions are shown in Table 5.6. The first column of the table lists the dimensions. The second column, labeled Validation of Intended Test Score Inference, outlines how each dimension applies to validation of the intended inference. The third column, Justification of Intended Test Score Use, provides corresponding descriptions related to the justification of test use. Table 5.6 Dimensions of validation and justification Dimension
Validation of intended test score inference
Justification of intended test score use
Rationale Timing Focus Tradition
Support for intended score meaning, interpretation Antecedent; primarily prior to test availability and use Primarily evidence-centered Primarily psychometric, argument-based
Warrants
Primarily technical, scientific
Support for specific implementation or use Subsequent; primarily after test is made available, put into use Primarily values-centered Primarily policy and program evaluation, argumentbased Primarily ethical, social, economic, political, and rhetorical Potentially recurring negotiated decision-making process Primarily test user, policy maker
Duration
Typically ongoing investigation to support substantive claims Responsibility Primarily test developer
Two notes regarding the table and the elaboration to follow in this section are warranted. First, the presentation in Table 5.6 of the two dimensions, validation and justification, is
purposefully parallel, intended to give equal priority to the distinct and equally important tasks of gathering evidence in support of an intended test score interpretation and gathering evidence in support of justifying an intended test score use. Second, although the following paragraphs highlight how validation and justification differ on the listed dimensions, the differences apply in general and are not necessarily universal. That is, there are no bright-line boundaries between the validation and justification efforts; specific instances will be observed in which the relative orientations on the dimensions will be reversed. Nor are there clean bifurcations with respect to the responsibilities of test developers and test users regarding validation and justification. In general, the elaborations related to the dimensions presented in Table 5.6 apportion greater responsibility for validation to test makers and greater responsibility for justification to test users and policy makers. Even so, it seems essential (or at least desirable) that meaningful collaboration involving all parties should occur when either effort is undertaken. For example, those involved in test development and validation should keep potential consequences in mind as test goals, formats, reporting procedures, audiences, intended inferences, etc. are determined. Those involved in gathering and evaluating information to support test use will also likely rely on information gathered in validation efforts about the meaning of the test scores. Given those caveats, the following paragraphs examine the seven dimensions in greater detail.
Dimension 1: Rationale The first dimension on which validation and justification differ is the rationale for gathering information. Concerning validity, the rationale for a validation effort is to gather evidence bearing on a specific, intended test score interpretation. Ordinarily, the burden to plan, gather, document, and disseminate this information falls on the developer of the test. However, the burden of gathering information to support test score meaning may at times fall on a test user when there is doubt as to whether the available evidence pertains to a specific, local context, or when the user contemplates an application of the test not anticipated by the test developer or supported by the available evidence. On the other hand, the rationale for information gathering for justification is to support the intended use. The justification effort may be conducted to examine the role of the test with respect to a policy decision that incorporates use of the test, to ascertain the extent to which anticipated benefits or costs are realized, or to investigate intended or unintended consequences of testing.
Dimension 2: Timing The second dimension on which validation and justification differ is the timing of the inquiry. As regards gathering of evidence to support the intended test score meaning, this effort is primarily antecedent to test use. Although, as described in Chapter 2, validation is a
continuing effort and evidence continues to be gathered following test use, a substantial and adequate portion of this work must occur before the operational use of a test; that is prior to using the test in consequential ways. As regards justifying a test use, some empirical evidence may be available, and some arguments may be formulated in advance of test use. However, the greater part of the justification effort typically occurs following the availability and operational use of a test. When a test is used primarily because of anticipated benefits, a good deal of the evidence justifying the use of a test cannot be evaluated until after the anticipated benefits would be expected to materialize. And, because the contexts of test use evolve, and the justification for a test’s use in one setting, time, and population does not automatically generalize to others, the justification effort—like the validation effort—is also an ongoing endeavor.
Dimension 3: Focus A third dimension on which the validation and justification efforts differ is their focus. Although both endeavors involve information gathering, validation is primarily data-driven and evidence-centered, whereas justification examines application and highlights differences in values brought to bear in the decision-making process. A theory/application dichotomization for this dimension is an oversimplification, but the distinction is useful. On the one hand, validation efforts are grounded in the desire to operationalize a specific theoretical orientation toward a construct, to deepen understanding about a characteristic, or to aid in refining the meaning of a particular construct. It is in this sense that the aphorism “all validity is construct validity” is meaningful. On the other hand, justification efforts are— or at least can be—agnostic as to whether an instrument advances basic knowledge in a discipline, extends theory, or fosters understanding of a particular construct. Justification efforts seek primarily to determine if a particular application yields anticipated benefits or promotes an outcome deemed to be desirable apart from any theory-building benefits.
Dimensions 4 and 5: Traditions and Warrants Flowing from distinctions in focus are the fourth and fifth dimensions on which validity and justification differ: the traditions brought to bear and the warrants for conclusions. The validation process invokes primarily psychometric traditions for evidence gathering and interpretation; the warrants for summary evaluative judgments about the adequacy, synthesis, and interpretation of the validity evidence are primarily technical. These traditions and warrants are often—tacitly or explicitly—endorsed with near unanimity by a disciplinary community (see, e.g., Kuhn, 1962; Longino, 2002). For example, the Standards for Educational and Psychological Testing (AERA, APA, NCME, 2014) represent the endorsed position of dozens of organizations regarding the sources and standards of evidence for the
validation effort. Psychometric traditions to guide validation have been developed, formalized, documented and disseminated for nearly the last 50 years. Similarly accepted and long-standing traditions and warrants for justification of test use do not yet exist; Chapter 5 is a beginning attempt to suggest what such sources and standards of evidence might look like for the justification effort.
Dimension 6: Duration The sixth dimension that distinguishes the validation and justification efforts is related to the temporal sequencing and duration of the efforts. As was indicated previously in Chapter 2, an accepted tenet of modern validity theory is that the case for validity is never truly considered to be closed. Efforts aimed at gathering evidence to support an intended test score interpretation should be ongoing; judgments about the appropriateness and accuracy of the intended inference are continually buttressed or threatened by the accumulation and analysis of new information, diverse applications, additional research, theoretical refinements and other factors that yield evidence after the initial validation effort. In contrast, although information gathered in justification of a test use should also be gathered continuously, the effort is typically considerably more decision-oriented and timecritical. The decision to use a test for a specific purpose is a process that incorporates the input and perspectives of those who allocate resources, policy makers, constituencies, stakeholders, and others affected by the use of a test. The process has as its immediate aim the practical outcome of deciding if and how a test should be used in a specific way. Thus, information gathering in support of test justification is more goal-oriented and focused on the decision at hand than the information gathering of a validation effort. The temporal aspect of justification of test use is also apparent in that the case for a specific use may be deemed weaker, stronger, or demand reconsideration to the extent that there is a change in context, policy, political goals, stakeholders, or a shift in power relations. Reconsideration of test use may not be undertaken under stable conditions, but may be demanded with changes in resources, when alternative procedures become available, when different policy aims prevail, when unintended consequences are discovered, or when questions arise about whether test results are being used in a manner that meets a perceived need or goal.
Dimension 7: Responsibility The final dimension on which validity of score inferences and justification of test use differ concerns the ultimate responsibility for the effort. As was suggested at the beginning of this chapter, because evidence of the validity of test scores should be developed prior to making a test available, it is the test developer who has the potential to engage in this analysis and who bears singular responsibility for establishing the level of confidence supporting intended test
score inferences. On this count, the framework presented here is especially powerful toward the goal of improving the practice of validation. The process for articulating a series of related, logical claims about the score inference—what Kane (1992, 2006a, 2009) has described as an argument-based approach—provides a straightforward starting point for contemplating the appropriate sources of evidence and judging whether the claims are adequately supported. As regards justification of test use, the responsibility for deciding upon appropriate sources of evidence to justify a particular use, for gathering that evidence, and for developing an integrated, evaluative judgment concerning test use is likely to be apportioned in different ways. In some cases, a test developer will have in mind not only a particular score interpretation, but might also intend a specific use for which the instrument is to be commended or marketed. In such cases, both the validation and justification burdens would fall on the test developer. In other cases, a test might be considered for a use that the test developer never envisioned. In such cases, the responsibility for justification of test use would fall squarely on the user. Even in situations where a test use might reasonably be anticipated by a test developer, the burden of justifying a specific test use would seem to fall more directly on the test user. It is the test user who decides to use a test in the first place; it is the test user who, when options exist, chooses one test over another; and it is the test user who typically associates consequences, rewards, sanctions or decisions with test performance. As Kane has indicated, “an argument can be made for concluding that the decision makers (i.e., the test users) have the final responsibility for their decisions … and they are usually in the best position to evaluate the likely consequences in their contexts of the decisions being made.” (2001, p. 338). Finally, in still other cases, a test developer may not have any specific interest (except, perhaps, a pecuniary one) in a test use, and would not ordinarily be in a position to aid decision makers with the policy calculus they must perform. Nonetheless, because the perspective of the test developer and the insights gained in the validation effort can provide valuable information bearing on a proposed test use, a collaborative justification effort between test developer and test user would seem advantageous.
Critical Commonalities The seven dimensions just described illustrate that validation of intended test score inferences can—and should—be distinguished from justification of intended test uses. Although the preceding sections highlighted essential differences between the validation and justification efforts, they also share common characteristics. First, the validation and justification efforts share a common design—one that requires an integrated evaluation of information to arrive at a conclusion. One conclusion is a judgment
about how adequately the evidence supports the intended test score interpretation; the other is a judgment about how compelling the case is for a specific use of those scores. Second, both the validation and justification efforts are not sets of mechanistic steps, but the sources of evidence, as well as their relevance and value, depend on the particular validation or justification situation at hand. The essential sources of evidence to support the validity of an intended score inference and the sources of evidence to justify an intended test use vary across test purposes, contexts, policies, and risks. The mix of evidence that is appropriate for one situation may be inappropriate for another. The determination of appropriate sources of validation evidence depends on the specific inference(s) that scores are intended to yield; the determination of appropriate sources of justification evidence depends on the specific use to which the scores will be put. For example, it is possible to imagine a validation effort for a hypothetical test claimed to support inferences about a multifaceted construct, with subscores formed on the basis of constellations of items purporting to measure different aspects of the construct. At minimum, the validation effort would seem to require empirical evidence about the internal structure of the instrument. Factorial validity evidence would buttress claims about the intended inferences; an absence of such evidence would pose a serious threat to claims about what the instrument measures. Of course, factorial validity evidence alone would be insufficient because variation in test performance may still be attributable to something other than the construct of interest; other sources of evidence would need to be mined, based on the specific claims (see Kane, 2006a). Necessary evidence would likely include documentation that the items were created and the subareas were formed to reflect the theory underlying the construct (i.e., Evidence Based on Test Content; Evidence Based on Test Development and Administration Procedures) and evidence that scores on the instrument do not reflect covariation with another, different construct (i.e., Evidence Based on Hypothesized Relationships among Variables). In assembling the case to justify a specific use of this hypothetical test, the appropriate evidentiary sources would depend on the particular context, intended outcomes, values, and constituencies involved in the evaluation. For example, suppose that the test was determined to have adequate validity evidence to support the intended score inferences. Justification evidence would now be required. Such evidence might include information on the personnel time required to appropriately administer and interpret the test results, alternative measures that could be used, consideration of other priorities for spending the resources allocated to testing, or information about the usefulness of the subscore information to clinicians or the test takers themselves. A fourth commonality is that differences in resources will characterize both the validation and justification endeavors. Regarding the validation effort, researchers and test developers —ranging from graduate students to commercial publishers—differ in the resources they
bring to investigations of support for an intended inference. These disparities can result in stronger or weaker support for an intended score inference that may be unrelated to the quality of an instrument. Regarding justification of test use, disparities also exist; they may include resource differentials, but also involve differentials in the power, position, rhetorical skill and other characteristics of those involved in—or excluded from—the negotiation and evaluation process. These disparities can function as gatekeepers in determining what information related to test use is brought to bear. Importantly, any disparities in power and position that are manifested in the course of justifying test use do not affect support for the intended inference (i.e., the validity argument), although they can influence how arguments about legitimate or inappropriate test uses are framed and decided.
Some Common Threats to Confident Use of Test Scores Just as there are threats to the confident interpretation of test scores that must be addressed in the validation effort, there are threats to the confident use of test scores that should be addressed in the course of justifying an intended test use. Overall, major concerns related to an intended test use can be discerned from a review of the sources of evidence for justification provided in the preceding sections. A concise list would include: • • • •
insufficient evidence of score validity; the discovery of unintended consequences following test use; failure to collect evidence related to unintended positive consequences; differing values brought to bear in the justification effort (e.g., regarding what evidence is considered relevant to justifying a test use, differing weighting of the evidence, differing perspectives of the relative seriousness of false negative and false positive classification decisions; • the use of differing justification processes that can lead to potentially conflicting conclusions regarding an issue, question, claim or target of an investigation; and • assumed generalizability of the results of justification evidence in one context to other persons, places, settings, times, cultural contexts, etc.
The Comprehensive Framework for Defensible Testing Having now separately presented and described the sources of evidence and nature of validation, and the sources of evidence and nature of justifying an intended test use, it is now possible to consider both of these parts as a comprehensive framework for defensible testing. The complete framework is presented in Figure 5.3, and it has broad applicability. It is equally applicable to commercial tests, published measures, surveys, scales developed for
research—indeed, to any context in which an instrument is developed to tap a specific construct and when the administration and use of the instrument can be anticipated to affect the persons, systems, or contexts in which the scores will be used. The reconceptualization of defensible testing into the dual, distinct emphases of validation and justification illustrated in Figure 5.3 recognizes the distinct but interrelated relationship between intended test score interpretations and uses. The defensibility of test use surely depends on many factors, not the least of which is evidence supporting the intended score meaning; just as surely, value implications infuse all aspects of test development, validation, and use. Although interrelated, the differing purposes of validation and justification necessarily require the collection, synthesis, and evaluation of evidence from differing sources. Defensible testing is supported only when there is evidence that test scores can be confidently interpreted as intended and there is evidence that intended uses of the scores are justified. The distinction between validation of score inferences and justification of test use also highlights that discrepant outcomes from those investigations are entirely possible. A test with abundant support for score validity may be completely lacking in evidence justifying its use (or the justification evidence might present a strong case that an intended use should be proscribed). Or, a test may be judged as useful in a case where the evidence only weakly supports the intended inference. Indeed, the rationale for using a test with weak validity evidence may be that even a test with weak evidence improves decision making or that its use produces some ancillary desirable outcome. Examples of this abound, from political “push surveys” that are only nominally intended to gauge candidate preference but have the effect of sensitizing voters to candidates or issues, to the increased use of constructedresponse formats in achievement testing, which has been found to stimulate desirable instructional practices.
A Note on Consequences of Test Use as a Source of Validity Evidence A central thesis of Chapter 3 was that it is assessments, decisions, and results of testing that are consequential, not the validity. By implication, a common misconception is apparent: the validity or meaning of the score with respect to the intended construct does not depend on the specific use of the test. However, although validation of intended test score inferences and justification of intended test score uses have different aims and sources of evidence, they can also be related. For example, it was noted in Chapter 4 that, in some circumstances, evidence uncovered in the course of using a test can inform the meaning of scores and, in some cases, compel test developers to revise intended score interpretations or reconsider the specification of the construct that is the object of measurement. How might such “cycling back” occur? Information gathered after a test has been administered can be mined for evidence that the
construct the test purports to measure may have been inadequately specified. The two mechanisms by which this occurs are measurement errors of omission and commission described in Chapter 4; namely, the concern regarding construct misspecification (i.e., when important aspects of the construct affecting performance are not included in the measurement process) and construct-irrelevant variation (when additional characteristics beyond those intended to be measured affect performance). When identified, information about construct misspecification or construct irrelevant variation can be used to refine theory about the construct and improve the measurement instruments designed to tap it. This concern is reflected in the Standards: When unintended consequences result from test use, an attempt should be made to investigate whether such consequences arise from the test’s sensitivity to characteristics other than those it is intended to assess or from the test’s failure to fully represent the construct. (AERA, APA, & NCME, 2014, p. 30) It should perhaps be noted that regardless of whether a consequence has been anticipated, intended, unanticipated, or unintended, those consequences may provide important information related to construct misspecification or construct-irrelevant variation. To incorporate this possibility, Figure 5.3 includes two paths following from a positive decision regarding use a test. The solid line indicates that the consequences of test use clearly comprise a source of evidence that must be considered in justifying the continuing use of a test. In addition, the dotted line indicates that the same results of test use can, in some cases, serve to inform the intended score interpretation.
Conclusions Validity theory has advanced appreciably, and validity continues to be an evolving concept. Modern psychometrics has moved far from the notion that “a test is valid for anything with which it correlates” (Guilford, 1946, p. 429) to a more sophisticated paradigm with broadly accepted, fundamental tenets regarding validation of intended score meaning. However, modern testing practice has lacked broadly accepted fundamentals and guidelines for evaluating intended test uses. The framework presented here builds on the contemporary tenets of validity theory and addresses contemporary controversies. The comprehensive framework described here encompasses both of the important concerns that must be addressed—concerns about the validity of score meaning and concerns about the justifiability of an intended test use—while also differentiating these related inquiries and describing two parallel endeavors to gather evidence bearing on conclusions about them.
Beyond the procedures for ensuring that there is evidence that test scores can be confidently interpreted as intended by a test developer (i.e., validated), procedures and standards must be followed for ensuring that there is evidence that test scores can be confidently used as intended by test users (i.e., justified). A test score that has an established, singular, intended, validated meaning may nonetheless have numerous, diverse, intended or unintended uses, each of which must be justified. The first part of a comprehensive framework for defensible testing comprises the validation effort. The second part of the framework presented in this chapter comprises the justification effort. The complete framework presented here provides not only clarity regarding validity, but also practical guidance and sources of evidence regarding justifying test use; in addition, the framework addresses lingering issues and orphan concepts in testing such as OTL, fairness, and consequences of testing—concepts that are unrelated to score meaning but critical to consider related to defensible score use. Of course, reconceptualization alone will not resolve the oft-noted gap between validity theory and validation practice: Greater alacrity in validation and justification efforts is still required. One hoped-for consequence of differentiating between validity of score inferences and justification of test use is that rigor regarding both efforts will be enhanced. To the extent that the concept of validity is more clearly focused and justification efforts are stimulated, a revised framework can help foster the goals of facilitating more complete and searching validation practice, enhancing the quality and utility of test results, and enabling those who develop and use tests to improve the outcomes for the clients, students, organizations, and others that are the ultimate beneficiaries of high-quality test information. Finally, although compiling and describing potential sources of evidence for the validation and justification efforts is necessary, when conducting both efforts, the practical questions remain regarding how much evidence is enough. The considerations and potential answers to these questions are explored in Chapter 6.
6 HOW MUCH IS ENOUGH?
Just can’t get enough. (The Black Eyed Peas, 2010)
The comprehensive framework for defensible testing presented in Chapters 4 and 5 comprises two, equally important endeavors: (1) validation—the gathering and weighing of evidence bearing on an intended test score interpretation; and (2) justification—the gathering and weighing of evidence bearing on an intended test score use. Previous chapters have focused on the “gathering” part of those activities, addressing questions of what evidence is relevant and procedures for collecting and analyzing that information. This chapter focuses on the “weighing” part—addressing the question of how much evidence is enough to support confidence in intended test score meaning and use. The Standards and other professional literature on validity appear to recognize the importance of weighing the evidence, but they do not provide guidance provided on the question of how much evidence is sufficient. For example, Kane has indicated that “different interpretations/uses will require different kinds of and different amounts of evidence” (2009, p. 40), but he provides no guidance on the nature of those differences, kinds, or amounts. The question of “How much evidence is sufficient?” represents a continuing dilemma in validation; the question also pertains to justification. The question of sufficiency of evidence is critical to defensible testing, however, and the lack of guidelines for the evaluation of evidence may well contribute to the generally anemic state of validation (and justification) efforts. The state of affairs has been summarized by Professor Edward Haertel of Stanford University, past president of the National Council on Measurement in Education, former member of the National Assessment Governing Board and member of the National Academy of Education who observed: “If no amount of evidence is sufficient, then any amount of evidence will suffice” (personal communication, May 14th, 2017). On a more positive note and contrary to the Black Eyed Peas song lyrics, it is possible to get enough evidence to support validation of intended test score meaning and justification of an intended test use. First, a caution. Although guidance is desirable, any time that guidelines are proffered, there is a danger. In research methodology courses and dissertation research, graduate students often want to
know, “How many participants do I need in my research?” In testing contexts, those who plan to use an instrument often want to know, “What level of reliability is acceptable?” or, “How many observations on each variable are necessary?” In statistical analysis, the question often arises regarding what statistical significance level should be used or what qualitative label should be used to describe an observed effect size. Too often, what were initially offered as reasonable guidelines for specific circumstances become set in stone: 100 participants are needed; a minimum level of reliability is .80; 10 observations per variable, the .05 level of significance should be used; an effect size of .50 is “moderate.” Yet, all of the aforementioned guidelines lack context and judgment. Cowles and Davis (1982) describe the somewhat arbitrary identification of .05 as a guideline for statistical significance related to Sir Ronald Fisher’s notions of convenience and personal preference. In the course of developing effect size guidelines, Cohen (1988) expressed concern about the “many dangers” (p. 12) associated with formalizing effect size guidelines, and admitted that the guidelines he proposed were devised “with much diffidence, qualifications, and invitations not to employ them if possible” (p. 532). Glass et al. (1981) have also expressed reservations about adopting strict guidelines. In the context of effect sizes, they commented: There is no wisdom whatsoever in attempting to associate regions of the effect size metric with descriptive adjectives such as “small,” “moderate,” “large,” and the like. Dissociated from a context of decision and comparative value, there is little inherent value to an effect size of 3.5 or .2. Depending on what benefits can be achieved at what cost, an effect size of 2.0 might be “poor” and one of .1 might be “good.” (p. 104) The wisdom of Glass et al.’s observation applies equally to consideration of how much evidence is sufficient to claim confidence in the intended meaning of test scores or that an intended test use is justified. Thus, this chapter will provide neither quantitative indices of sufficiency, nor qualitative descriptors for specified levels of adequacy. Instead, the aspects briefly mentioned by Glass et al. will be explored in greater depth: decision making and comparative value. In addition to those two aspects, it is also important to recall a third aspect—values. The answers to questions about how much evidence is enough to support an intended score meaning or an intended score use surely involve technical considerations—but not solely technical considerations. The influence of values, perspective, and priorities applies as much to the gathering of validation and justification evidence as it does to evaluations of the sufficiency of that evidence. In the following sections, some primary factors that should be taken into account in support of conclusions about the adequacy of evidence for test score meaning and use are discussed. In addition to the factors presented here, the realities of local
contextual constraints, budgets, needs, and other relevant factors are not possible to treat in adetail, but for a specific proposed test use may nonetheless need to be examined as much as or more so than those addressed in the following sections.
How Much Evidence is Enough to Support an Intended Test Score Meaning? Consonant with the viewpoints endorsed above regarding the inappropriateness of hard, quantitative criteria for nuanced substantive matters, the position taken here regarding validation evidence is that it is not possible to provide simplistic answers to the question of how much evidence is enough. Answers such as “One needs at least two pieces of validity evidence” or “Test scores can be deemed valid only when a validity coefficient is greater than or equal to .80” will rarely be adequate for a given context and could well lead to less thorough validation and justification efforts. The ineluctable conclusion regarding defensible validation practice is that it is a matter of professional judgment. That does not imply, however, that one judgment is equal to any other judgment. Rather, some judgments about the validity of intended score meaning are better than others; it is the factors considered in making those judgments and the evidence brought to bear that constitute the basis for commending one conclusion over another. Answering the question about how much evidence is sufficient to support conclusions about the intended meaning of a test score depends on consideration of the five factors described in the following sections, which generate five principles to guide validation efforts.
1. The Purposes of Testing In the previous chapter, Table 5.1 provided a list of some of the diverse purposes served by testing. Along those lines, a first factor to consider in answering the question of how much evidence is sufficient for a given validation or justification effort is the purpose(s) of the test. Taking testing purpose into account comprises two facets: the nature of the intended test score inference and the seriousness of the consequences associated with errors regarding those intended inferences. The intended inferences to be made from test scores vary in their complexity and in the seriousness of arriving at incorrect conclusions about examinees regarding their standing on a construct. Accordingly, a first guideline regarding validation is: The extent of evidence to be collected and evaluated to support the intended inference should vary proportionally to the nature of the inference and the consequences of inferential errors.
As a simple illustration of this principle, it is instructive to compare the following five situations involving differing intended inferences: (1) a quiz to determine if a kindergarten student knows the ordinal numbers one through ten; (2) a survey of senior citizens regarding opinions about national defense spending; (3) an adult personality assessment of emotional engagement; (4) an adolescent screening for risk of self-harm; and (5) a licensure examination covering knowledge and skill for the safe and effective practice of radiology. The preceding list begins with the least complex intended inference and the least serious nature of an incorrect inference; each of the following entries in the list presents a situation intended to reflect increasing complexity of the measured construct and increased seriousness of an incorrect inference. In the first case, the intended inference would appear to be fairly straightforward and limited, as would the procedures for measuring the ability and the consequences of an inaccurate inference. Simply asking a kindergartener to count to 10 might suffice and, assuming that promotion to first grade, differentiated instructional planning or some other more serious results were not associated with that knowledge, then an incorrect inference would seem fairly benign. In the second situation, constructing items for a public opinion poll regarding senior citizens’ views on controversial national issues is perhaps somewhat more complicated than assessing counting skill. And, although if errors were pervasive misguided input could be provided for informing national defense spending policy, any individual error (e.g., concluding that a survey respondent had a positive opinion when he or she actually had a negative opinion) would not seem terribly consequential. The third situation, personality assessment, presents a substantially more challenging context, both in terms of the complexity of the construct being assessed (e.g., personality factors; see Cattell, 1946; Cattell & Mead, 2008) and the development of instrumentation to assess the construct. In terms of the seriousness of errors, the ramifications of a mistaken test score inference would seem modestly more consequential. If a client were interested in greater self-awareness of personality—and assuming that hiring, job advancement, or other career or social opportunities did not hinge on the results—the most serious consequences of an incorrect inference (i.e., the client was not truly outgoing but more reserved in terms of emotional engagement) would seem to be dissonance, self-doubt, or other personal, emotional, or psychological discomfort on the part of the client. The fourth situation listed above regarding measurement of self-harm potential in
adolescents presents not only the same complexities of theory, operationalization, and instrument development as the personality assessment context, but also more serious consequences of incorrect inferences. On the one hand, there may be social stigma, misspent resources, and psychological harm associated with erroneously identifying an adolescent as at risk of self-harm (i.e., a false positive classification error). On the other hand, there could be literal life-and-death implications of a false negative classification error; that is, making an inference that an adolescent poses no risk of self-harm when such a risk is truly present. The fifth situation is provided to illustrate not only increased complexity of the construct at hand (i.e., competence for safe and effective practice of radiology), but also substantial consequences associated with errors regarding inferences about candidates’ standing on that construct. In this situation, the identification of critical, discriminating areas of knowledge and skill would be a necessary and complicated undertaking, but classification errors (e.g., false positive licensure decisions) would pose a broad threat to the health of a potentially large number of patients. These differing contexts provide a basis for deriving a two-fold principle related to the question of how much validity evidence is enough. First, more evidence supporting the validity of an intended inference is necessary when the construct being assessed is more complex: a more straightforward inference—that is, conclusions that require less of an inferential leap—requires less evidence; validation of inferences about a more multifaceted, theoretically complex, or practically challenging construct requires a more robust evidence collection effort. Second, the amount of evidence required is inversely proportional to the consequences of an inaccurate inference: More evidence is needed when an inaccurate inference is judged to have more severe consequences for persons, organizations, or groups; less evidence is needed when the consequences of inaccurate inferences are less severe or even non-existent.
2. Quantity vs. Quantity The second principle that guides how much validity evidence is sufficient to support an intended test score inference centers on the sheer amount of evidence that can (or should) be collected. This issue of quantity is two-fold, however. It may be recalled from Chapters 2 and 4 that some traditional categories exist for the sources of evidence that can be mined in support of an intended inference. In addition, there are diverse methodological options within each of those categories for the kinds of evidence that might be collected. Thus, there are two different issues related to quantity that must be examined, and a general principle can be articulated: Whenever appropriate and feasible, multiple sources of validity evidence should be gathered.
A corollary to this principle is that an abundance of validity evidence from one source does not compensate for diversity of evidence from multiple sources. An example from the area of achievement testing can illustrate this principle. Suppose that a test was developed to measure knowledge of algebra in the population of high school juniors intending to pursue post-secondary education. Further suppose that the inference proposed by the developers of the test was algebra competence necessary for success in college-level mathematics courses. As stated, this intended inference suggests, at minimum, that two distinct sources of evidence must be investigated: (1) Evidence Based on Test Content, as a means of supporting inferences that performance on the test provides an indication of examinees’ standing vis-à-vis their algebraic knowledge and skill; and (2) Evidence Based on Hypothesized Relationships among Variables, in particular, perhaps, predictive validity evidence that scores on the test were related as predicted to the intended criterion (i.e., success in college-level mathematics courses). In the situation described above, the issue of Quantity versus Quantity becomes clear. An abundance of evidence based on test content would be desirable. For example, evidence might be gathered to demonstrate that the content of the test was tightly aligned to the curricula of high school algebra courses. Evidence might also be gathered that all of the important subareas of algebra (e.g., variables, operations and functions, types of algebraic equations, graphical representations, etc.) were covered by the test. Further evidence might document the qualifications of the test’s item writers and might demonstrate that the questions on the test had been adequately screened for substantive accuracy and clarity and for absence of bias, editorial errors, or unnecessary linguistic complexity. This information would bear strongly on conclusions about the content-based evidence for the meaning of scores on the test. However, even an overwhelming amount of positive evidence based on test content would not compensate for an absence of evidence related to the other critical component of the intended inference; namely that that the test was predictive of success in college-level mathematics courses. Such Evidence Based on Hypothesized Relationships among Variables might come from correlational studies of test performance and grades in subsequent mathematics courses, test performance and college instructor evaluations of students’ levels of preparedness, and so on. As this situation illustrates, there are two different quantities of evidence that are relevant. First, it is perhaps obvious that a greater quantity of evidence from a single source (e.g., Evidence Based on Test Content) provides greater support of the intended meaning of a test score than does only a single piece of evidence. Second, and typically more compelling as regards the intended test score meaning, comparatively modest evidence from a number of different sources would ordinarily provide stronger support than narrow and perhaps redundant evidence from only a single source. As feasible and appropriate to the intended inference, the question of how much evidence is enough is best answered by evidence that
spans diverse sources and converges on a coherent conclusion about the meaning of scores yielded by an instrument or procedure.
3. Quantity vs. Quality The third principle for answering questions about the adequacy of evidence to be collected for a validation effort concerns evaluations of the comparative quantity and quality of the evidence gathered. Here, also, two different issues must be considered: the amount of evidence that can be gathered and the quality of each of those pieces of evidence. In planning a validation effort, it may be easier, less costly, or more efficient to collect some kinds of evidence than others. It would be of particular concern if the easiest, least costly, and most efficient evidence-gathering activities that had the potential to at most provide tepid support for an intended inference were given preference as part of a validation effort, and more difficult, more costly, or less efficient sources of evidence were avoided. This concern can be articulated as a third principle: Evidence gathering that provides the strongest, direct support is preferred over evidence that is less directly relevant to the intended test score inference. There is also some additional guidance related to this principle. In validation efforts, one of the strongest pieces of evidence in support of an intended test score meaning can be derived from searching investigations for what has been called disconfirming evidence. The value of disconfirming evidence, a term which will be defined shortly, is especially noteworthy given the tendency in human behavior—including scientific endeavors—for confirmation bias. In the field of psychology, confirmation bias is defined by Nickerson (1998) as: “the seeking or interpreting of evidence in ways that are partial to existing beliefs, expectations, or a hypothesis in hand.” Nickerson goes on to offer the opinion that, “If one were to attempt to identify a single problematic aspect of human reasoning that deserves attention above all others, the confirmation bias would have to be among the candidates for consideration” (p. 175). Evans (1989) similarly described confirmation bias as “the best known and most widely accepted notion of inferential error to come out of the literature on human reasoning” (p. 41). Of course, inferential errors are a particular concern in testing, and many validity theorists have cautioned that confirmation bias can exert powerful effects in the planning and conduct of validation studies, subtly working to result in validation efforts that are limited to considering sources of evidence most likely to support an intended inference and excluding sources that would present challenges to that inference. According to Kane, “most validation research is performed by the developer of the test, creating a natural confirmationist bias”
(2004, p. 140). To combat confirmation bias in testing, it has been routinely recommended that alternative, plausible inferences be considered as part of the validation effort. Again, according to Kane, “the basic principle of construct validity calling for the consideration of alternative interpretations offers some protection against opportunism, but like many validation guidelines, this principle has been honored more in the breach than in the observance” (2004, p. 140). Kane traces the recommendation—and the lack of alacrity in adherence to it—to Cronbach (1989) who observed that “despite many statements calling for focus on rival hypotheses, most of those who undertake [validation] have remained confirmationist. Falsification, obviously, is something we prefer to do unto the constructions of others” (p. 153). What Kane, Cronbach, and other sources—including this one—recommend is the gathering of what has been called potentially disconfirming evidence. Gathering potentially disconfirming evidence consists of asserting one (or more) of the most plausible rival hypotheses for the meaning of a test score and amassing evidence bearing on those potential alternative score meanings. Importantly, evidence gathering vis-à-vis the rival hypotheses should be conducted with the same enthusiasm as would be expended in collecting evidence in support of the intended meaning. Conceptualizing the possibilities for such evidentiary sources and mining them is among the primary obligations of those engaged in validation efforts. Cronbach (1980) has described the importance of disconfirming evidence in this way: “The job of validation is not to support an interpretation, but to find out what might be wrong with it. A [proposed score interpretation] deserves some degree of trust only when it has survived serious attempts to falsify it” (p. 103). Clearly, success in uncovering disconfirming evidence can reduce confidence in the intended score interpretation and may dictate reconsideration of the proposed interpretation —or substantial revisions to the test development process. Alternatively, the failure to support a rival hypothesis of an alternative score interpretation via the search for disconfirming evidence actually constitutes among the strongest validity evidence in support of the intended inference.
4. Resources It is unrealistic to answer the question of how much validity evidence is enough without reference to the extent of resources that are available for conducting the validation effort. Large, commercial test publishers may have existing personnel capacity to conduct largescale validation efforts and the financial resources to search out abundant confirming and potentially disconfirming information. At the other extreme, a graduate student in psychology seeking to develop and use an
instrument as part of his or her dissertation may have access only to the professional assistance of an advisor, and available resources comprising a small research grant or no additional financial resources at all. It seems equally unrealistic for the commercial publisher to engage in only the extent of validation efforts that would be expected of the graduate researcher as it would be to hold that researcher to the same level of expectation as the commercial publisher. It is obvious that validation resources vary substantially across the diversity of test development and evaluation contexts, and the guideline suggested here might be reminiscent of a Marxist admonition: “From each according to his ability.” However, it should also be noted that a number of other factors must be considered, of which resources is only one—and in most cases likely not the most important factor. Torturing the Marxist allusion only slightly, a companion guideline might also be: “From each according to the need.” That is, in situations where serious consequences would follow from the inferences made from test scores but in which only scant resources can be allocated to the validation effort, the most defensible course of action might be to delay the validation effort until adequate resources are available, or to reconsider development and use of the test in the first place. In brief, the relevant principle related to sufficiency of validation evidence and resources is: Those responsible for designing and conducting the validation activities should allocate resources that are proportional to the complexity of the construct assessed, the consequences of incorrect inferences, and the entity’s capacity for conducting a rigorous validation effort. In summary, answering the question of how much evidence is enough involves consideration of validation resources. The process involves several elements, including analyses of what resources are available to support the validation effort, the relative capacity of an entity to provide the necessary resources, and the extent to which the available resources are judged to be adequate to yield the validity evidence needed to support the intended test score inferences.
5. Burden A final consideration that comes into play in the evaluation of the extent to which a validation effort is sufficient is the burden associated with the effort. Engaging in the work of validating intended test score inferences comes at a cost. Beyond the element of resources just described, there are other burdens that come into play and foremost among these are the burdens placed on individuals and organizations that participate in the validation effort. Validation efforts are bothersome to organizations. Time must be diverted from some organizational activities to aid researchers in the identification of appropriate validation
samples. Time and space must be allocated for researchers to administer tests under development to those samples and perhaps to administer criterion measures that may be essential to the validation effort. An organization may need to provide lab space or other suitable arrangements for the conduct of cognitive interviewing. In most circumstances, those conducting validation research may need access to demographic variables or other information maintained by the organization; facilitating secure access to such information also requires a commitment on the part of the organization. Finally, in many cases, collaboration will be necessary between an organization and those engaged in the validation effort to comply with the requirements of an Institutional Review Board (IRB), to ensure adequate protection of the rights of human participants. Validation efforts also place burdens on individuals. Some individuals may be asked to participate by responding to a test, survey, interview protocol or other object of the validation effort. Such individuals are sometimes compensated for their time; in other situations they may be asked to participate without compensation as professional service or other appeal to civic or social betterment. Some individuals may be asked to participate by contributing their time and expertise to review test items, survey questions, scoring protocols, or other test materials for age/developmental appropriateness, bias/sensitivity, theoretical alignment, and so on. Even if compensated for their efforts, the most qualified individuals to perform such activities are often in high demand, and consent to assist the validation effort comes at a cost of having to balance other potential or existing obligations. In summary, the final principle for evaluating whether enough evidence will be gathered when engaging in validation of intended test score meaning concerns the burden that the evidence gathering will place on those involved in the effort: Defensible validation research balances the potential sources of evidence that might be mined and their value to the validation effort with the burden on individuals and organizations to participate in gathering that evidence.
Conclusions about Validation Answering the question of how much evidence is sufficient for a given validation effort is not a matter of applying straightforward criteria. As Glass et al. (1981) have stated, any such criteria would necessarily be dissociated from a context of decision and comparative value and would have little inherent value. However, it is possible to provide guidelines that can be used to aid in the design and conduct of validation activities, as well to provide the basis for evaluating a validation effort for sufficiency. In the preceding sections, those broader concerns of decision and comparative value were examined and five recommendations were provided for defensible validation practice (see Table 6.1). The recommendations explicitly incorporate attention to the context of the validation effort, the nature of the construct being
measured, the implications of inferential errors, and the interplay among these elements. Table 6.1 Guidelines for evaluating the sufficiency of validity evidence Evidentiary Guideline concern Purposes of testing Quantity vs. quantity Quantity vs. quality Resources
Burden
The extent of evidence to be collected and evaluated to support the intended inference should vary proportionally to the nature of the inference and the consequences of inferential errors. Whenever appropriate and feasible, multiple sources of validity evidence should be gathered.
Evidence gathering that provides the strongest, direct support is preferred over evidence that is less directly relevant to the intended test score inference. Those responsible for designing and conducting the validation activities should allocate resources that are proportional to the complexity of the construct assessed, the consequences of incorrect inferences, and the entity’s capacity for conducting a rigorous validation effort. Defensible validation research balances the potential sources of evidence that might be mined and their value to the validation effort with the burden on individuals and organizations to participate in gathering that evidence.
As is evident, that interplay necessarily involves trade-offs, values and the exercise of professional judgment. For example, the validation of intended inferences regarding a more complex construct may require a greater diversity of evidentiary sources and more burden, but the particular sources may need to be determined with respect to an organization’s available resources and with effort to minimize the associated burdens. The answer to the question of “How much is enough?” must even allow for the answer to be that the validation effort should not be conducted. This may be particularly true when sufficient resources are not available to adequately support an intended inference that will have serious implications for examinees, organizations, or decision makers. In such a circumstance, the decision not to pursue a test development effort may well be preferable to an inadequate validation effort that provides a patina of psychometric propriety for a measurement instrument or procedure that misinforms test takers and test score users.
How Much Evidence is Enough to Support an Intended Test Use? Like the answer to how much evidence is sufficient for a validation effort, no hard criteria exist—or should be formulated—to prescribe the kind and amount of evidence that should be gathered as part of the process of justifying an intended test use. Rather, answering the question of how much evidence is enough to support an intended test use also involves consideration of several factors, and of interplay among those factors, values, and professional judgment. The following sections present considerations to guide justification efforts; guidelines related to these considerations are brought together in Table 6.2.
Validity Evidence In Chapter 2, the familiar dictum regarding the relationship between reliability and validity was cited: evidence of the reliability of test scores is a necessary prerequisite for considering the validity of those scores. It makes no sense to talk about the meaning of scores that may, in the worst-case reliability situation, reflect only random measurement error and no real differences in examinees’ standing on a construct of interest. Until evidence can be provided that a test measures something with dependability, claims that it measures anything are specious. A parallel situation exists with respect to validation and justification evidence and a first guideline for evaluating the adequacy of a justification effort follows: The justification for an intended test use must first consider the strength of the validation evidence for the intended score meaning. Extending the relationship between reliability and validity to the assessment of whether it is justified to use a test for some intended purpose, it makes no sense to talk about the utility of test scores for which, in the worst-case validity situation, there is inadequate evidence regarding what those scores mean—or even disconfirming evidence that they cannot be interpreted to mean what they are claimed to mean. If the evidence in support of the intended meaning of test scores (i.e., their validity) is judged to be sufficient, and if it is concluded that the scores can be interpreted with confidence with respect to the construct of interest, then it is at least possible to contemplate using scores from the test for some intended purpose. Lacking such evidence, their use would not be defensible. This principle is illustrated in the comprehensive model for defensible testing shown in Figure 5.3. When the evaluation of the validity evidence is negative—that is, the evidence fails to adequately support the intended inference—then the test developer is faced with a “back to the drawing board” situation. That is, as shown in the model, the test developer must reconsider the intended score meaning and revise the test development process and validation activities accordingly. A related validity concern that should be taken into account when evaluating the strength of the case for justifying an intended test use is a comparison of the validity evidence for the alternative measures that could be used for the same testing purpose. In Chapter 5, it was noted that, when a specific test is proposed for a certain use, there are often other tests that might be used and other non-test alternatives, and the alternative of not testing at all should be evaluated. Each of these alternatives has some amount of validity evidence bearing on it. That is, when each has some degree of evidence associated with it (or should have such evidence collected and evaluated), evaluation of that evidence allows comparison of the alternatives in terms of the confidence they permit regarding the intended inferences to be
made. When alternatives exist, the extent of validity evidence supporting each of the alternatives provides one criterion for evaluating which, if any, of the alternatives is more justifiably used.
Resources and Burden The preceding section identified the psychometric criterion of validity as a necessary condition for justifying an intended test use. However, it is perhaps useful to consider that criterion as analogous to an ante for a hand of poker: it is enough to begin the process of justification, but is insufficient to sustain the effort. Greater investments in the justification effort must be made. Like the validation effort, it is unrealistic to answer the question of how much justification evidence is enough without reference to the extent of resources that are available for conducting that effort. A review of the methodological menu for justification indicates that, like the validation effort, there can often be substantial costs involved and substantial resources may need to be allocated to activities such as: • • • •
investigation of intended and unintended positive and negative consequences of test use; consideration of fairness issues; investigations into the appropriateness of alternatives to the proposed test; and the conduct of cost-benefit, cost-effectiveness, or cost-utility studies, and including the costs to individuals or organizations if anticipated benefits of testing are delayed or overestimated.
Finally, and again similar to the validation effort, the burden created by the justification effort for individuals and organizations must be considered, along with the capacity of those engaged in the justification effort to gather the needed evidence, and the trade-offs that must be made when choices are made to prioritize some sources of evidence for justification over others. Guidelines for the justification effort related to resources and burden are necessarily parallel to those for the validation effort: Those responsible for designing and conducting the justification activities should allocate resources that are proportional to the consequences of test use for individuals or organizations, and the entity’s capacity for conducting a rigorous justification effort. and Defensible justification research balances the potential evidence that might be gathered and its value to the justification effort with the burden on individuals and organizations to participate in gathering that evidence.
Input A fourth factor to be considered when evaluating whether the justification effort has been sufficient pertains to the adequacy of input obtained. As noted in Chapter 5, the identification of stakeholders is an essential element relevant to justifying the proposed use of a test. Several questions related to stakeholder input must be considered: • • • • •
Were all appropriate stakeholders identified? Were procedures designed and followed to solicit and obtain their input? Were those procedures effective? Was stakeholder input conscientiously considered in deciding on test use? Were evaluations conducted to ascertain stakeholders’ satisfaction with their ability to provide input and the extent to which their input was considered in the final use decision? In brief, a fourth guideline for evaluating the adequacy of a justification effort is: The adequacy of justification evidence depends on the inclusiveness of identification of relevant stakeholders affected by the proposed use, and the conscientiousness of efforts to obtain and incorporate their perspectives on the use decision.
Need It is often the case that a test is developed to support decisions that need to be made. Mehrens and Cizek (2001) have articulated the inescapable need to make categorical decisions in many diverse contexts, observing that: Categorical decisions are unavoidable in many situations. For example, high school music teachers make such decisions as who should be first chair for the clarinets [and] college faculty members make decisions to tenure (or not) their colleagues. Each of those types of decisions … is unavoidable; each should be based on data; and the data should be combined in some deliberate, considered fashion. (p. 479) They conclude that, “If categorical decisions must be made, it is arguably more fair, more open, more wise, more valid and more defensible when the decisions are based on explicit criteria.” (p. 479). Accordingly, a justification guideline related to need can be formulated: In evaluating whether sufficient evidence is gathered in support of an intended test use, the need motivating that use and the relative value of alternatives must be considered.
Previously, the context of a selection procedure for an entering class of firefighter trainees was mentioned. In addressing the question of whether it is justifiable to use a newly developed test to aid in the selection, the issue of need must be addressed. That is, are there more applicants than training spots available? If only 28 candidates have applied for the 30 available training slots, there may be no reason to use a test as part of the selection process. That is, an alternative that should be investigated is the added value of using a specific test under consideration over using no testing mechanism at all. In other situations, it may be the case that no other selection mechanisms have been developed or the choice is limited to instruments that have not been validated for the specific intended use but for some related use. In such cases where there is a need to make decisions, the absence of alternatives may (weakly) provide justification support for the test use under consideration.
Consequences of Use Alas, consequences. Perhaps no other element bears the same level of importance in justifying an intended test use as the consequences of using the test for the proposed purpose. A guideline for judging the adequacy of the body of evidence justifying a test use touches on the many, diverse aspects of this element that must be examined: Adequate evidence justifying an intended test use considers the likelihood, severity, and malleability of intended and unintended positive and negative consequences of the intended use. The consequences of a proposed test use may be positive or negative, intended or unintended. For example, consider a newly developed scale to measure depression; the scale was developed using an innovative method of item selection. An intended positive consequence of using the newly developed scale might be that it yields greater accuracy in identification of persons in need of intervention; an unintended positive consequence might be that it stimulates greater attention to depression among clinicians, the relevant client population, or public health policy makers. It is doubtful that in many cases a test user would desire intended negative consequence to occur. However, unintended negative consequences can be detected following test use; for example, an unintended negative consequence of using the depression measure might be increased health care costs or the increased need for additional interventions related to feelings of stigma associated with the diagnosis. The severity and frequency of these unintended negative consequences should be included when they form part of the body of evidence for justifying the use of the test. Although, understandably, the greatest level of concern accompanies unintended negative consequences, the weighing of the body of evidence regarding justification of a test use is likely incomplete if unintended positive consequences are not also considered. Positive
consequences—even positive intended consequences—are often overlooked because of (understandable) motivations to identify and minimize negative consequences of test use (see Cizek, 2001). Extending the consideration of consequences of use for the depression scale, it may be that the observed clinical benefits of using the new scale prompt development or revision of scales for other constructs using the same innovative item selection procedures— an unanticipated positive consequence of the test’s use. The body of justification evidence is insufficient if the seriousness of classification errors is not also accounted for as part of the justification effort. For example, previously mentioned in this chapter related to the adequacy of validity evidence was the purpose of testing and the need to consider the seriousness of the classification or decision errors. Estimation of the proportions of false negative and false positive classifications not only bears on the confidence that can be sustained regarding the meaning of scores (i.e., their validity), but the same estimated proportions of decision errors must be taken into account as a factor in deciding whether that extent of errors can be justified if the test is used as intended. Notably, it is possible that, as part of the validation effort, the proportion of false negative and false positive errors might be judged to be small from a psychometric perspective. However, when justifying a test use, that small proportion of errors may be deemed acceptable for one proposed use, whereas the same small proportion may be deemed unacceptable for a different proposed use. For example, consider a test that yields scores evaluated to be valid indicators of the reading comprehension of third graders and that a criterion is established above which students are classified as having a “Proficient” level of reading comprehension and below which they are classified as having a “Deficient” level of reading comprehension. In evaluating whether the scores are valid indicators of students’ standing on the construct, the relative likelihoods of false negative and false positive classifications were estimated and judged to be small. However, that degree of classification error might be evaluated differently when addressing the question of whether it is justifiable to use the scores, and answering the question of justifiability depends on the intended use. A modest frequency of classification errors might be judged to be tolerable and use of the test judged to be justifiable if the intended use was to screen students for additional reading intervention; the same frequency of classification errors might be judged as unacceptable— and, hence, the proposed use not justifiable—if the test was proposed for use as a third-tofourth grade promotion requirement. Similar to consideration of the frequency of false negative and false positive classification decisions when evaluating whether it is justifiable to use a test is consideration of the kind of decision being made. For example, in general, lesser justification evidence would typically be required for a test that is intended for self-information or proposed for formative, diagnostic purposes, than for a test that is summative, used for selection or accountability purposes (see Table 5.1).
In addition to consideration of the kind of decision is the extent to which the decisions are malleable. That is, the same modest body of evidence could be considered to be adequate if the justification effort concerned the use of a test in situations where decisions were only tentative, but insufficient if the decisions were not subject to reconsideration. In general, to the extent that the decision is not irrevocable, to the extent that decisions are not made exclusively based on performance on a single test (i.e., they are based on multiple measures or multiple sources of information are considered), or to the extent that processes are in place to allow for decision appeals, a more modest body of justification evidence may suffice because the risks associated with use are less consequential. Finally, a factor that must be considered along with the intended or unintended consequences of a proposed test use is the probability of any unintended uses and the consequences associated with those uses. Some unintended uses may be reasonably innocuous when the same body of validity evidence is relevant to both uses or when a strong similarity exists between the intended use and the unintended use. In such cases, the body of evidence justifying the intended use might reasonably apply to the unintended use. However, in many cases, such “off-label” uses are not defensible and admonitions against using a test for unsupported purposes should be documented as part of the justification effort. These principles are expressed in the commentary accompanying Standard 1.3 of the Standards for Educational and Psychological Testing: If past experience suggests that a test is likely to be used inappropriately for certain kinds of decision or certain kinds of test takers, specific warnings against such uses should be given. Professional judgment is required to evaluate the extent to which existing validity evidence supports a given test use. (AERA, APA, & NCME, 2014, p. 24)
Table 6.2 Guidelines for evaluating the sufficiency of justification evidence Evidentiary concern
Guideline
Validity
The justification for an intended test use must first consider the strength of the validation evidence for the intended score meaning. Resources Those responsible for designing and conducting the justification activities should allocate resources that are proportional to the consequences of test use for individuals or organizations, and the entity’s capacity for conducting a rigorous justification effort. Burden Defensible justification research balances the potential evidence that might be gathered and its value to the justification effort with the burden on individuals and organizations to participate in gathering that evidence. Input The adequacy of justification evidence depends on the inclusiveness of identification of relevant stakeholders affected by the proposed use, and the conscientiousness of efforts to obtain and incorporate their perspectives on the use decision. Need In evaluating whether sufficient evidence is gathered in support of an intended test use, the need motivating that use and the relative value of alternatives must be considered. Consequences Adequate evidence justifying an intended test use considers the likelihood, severity, and malleability of intended and unintended positive and negative consequences of the intended use.
Commonalities and Conclusions Although different guidelines are relevant to validation and justification efforts, they share many commonalities. One commonality is that, regardless whether the inquiry concerns validation or justification, marshaling as much expertise and information as possible provides the best foundation for decisions about intended inferences or uses. Further, the defensible practice of testing affirms that conclusions about score meaning or use are never truly final. It is also recognized that resources, timelines, and other practical realities dictate that the the validation and justification efforts must result at some point in at least tentative judgments about test score meaning and use. However, even when tentative conclusions about meaning and decisions about use are made, best practice requires that those engaged in the validation and justification efforts be ever-vigilant for the emergence of additional relevant evidence bearing on those conclusions and decisions, and open to disconfirming evidence and alternative interpretations of the evidence. Also common to both the validation and justification efforts is that they are grounded in the priorities, policies, power, resources, and ethical considerations of those involved, and both require the application of expertise, evaluation of evidence, and values. Because of this, rigid standards specifying the sources or extent of evidence that are “enough” are not realistic or perhaps even desirable. Nonetheless, reasonable and appropriate guidelines can still be discerned that provide guidance on the factors relevant to making judgments about the sufficiency of validation and justification evidence. Defensible validation and justification efforts explicitly recognize the factors that should be considered and the contextual differences that determine whether greater or lesser amounts of evidence are needed.
As noted previously, another commonality is the recognition that conclusions about validation and justification are never truly final. As the theory regarding a construct develops, as the procedures for its measurement are revised, as the populations in which the instrumentation is used change, as the intended inferences are refined, as the stakeholders, uses, stakes of testing, and intended test uses evolve, and as the consequences of those uses are observed, so must the bodies of evidence supporting those inferences and uses be revisited and supplemented. Finally, framing the questions of adequacy for the validation and justification efforts in terms such as those in the title of this chapter, “How much is enough?” might connote that it is the quantity of evidence that is the primary consideration. Surely, the greater the amount of evidence that can be brought to bear, the more confidence can be had in conclusions about intended test score meaning or use. However, the quality of that evidence is not a secondary consideration. For example, concrete evidence on a single, severe, negative consequence of testing may be more dispositive than numerous cost-benefit studies, evidence of stakeholder input, documentation of due notice, and evidence of superiority vis-à-vis other alternative testing methods, formats, and procedures. In conclusion, validation of intended score meaning and justification of intended test use both require the gathering and synthesis of the bodies of evidence that bear on the different questions addressed by those efforts. Invoking Messick’s (1989) view, the adequacy of evidence for drawing conclusions about the validity of test score interpretations or the justification of test use are ultimately professional, evaluative judgments informed by those bodies of evidence—that is, by the synthesis and critical appraisal of the relevant logical, theoretical, and empirical information related to the key questions of meaning and use. For both efforts, it is important to assess whether those bodies of evidence are sufficient to support an intended score meaning or a proposed test use. This chapter has described guidelines to assist test developers and test users in determining if the body of evidence gathered is adequate for the conclusions to be made. Although the guidelines are intended to provide clarity and direction for the validation and justification efforts, they will surely require revision and reconsideration pending developments in the theory and practice of testing. In the final chapter of this volume, some overall conclusions related to the comprehensive approach to defensible testing are presented, and reflections on additional research and development work needed to further advance validity theory and practice are provided.
7 CONCLUSIONS AND FUTURE DIRECTIONS
Validation has changed appreciably in the past 60 years … It is difficult to know what validation will be like in the future, but changes in the concept of test validation are likely to parallel dynamically the continuing maturation of our science. (Geisinger, 1992, p. 219)
For over half a century, validity has uniformly been afforded the highest place among concepts central to the theory and practice of testing. Among other accolades, it has been labeled “the cardinal virtue in assessment” (Mislevy, Steinberg, & Almond, 2003, p. 4). For a concept of such esteem, it is remarkable that, in comparison to unanimity about its place, there has been such disagreement about its meaning. In the broadest terms, this volume represents an attempt to reconcile those disagreements. To that end, key areas of agreement discernible in contemporary thinking about validity have been reviewed, and a recommendation for reconciling areas of disagreement in the form of a comprehensive framework for defensible testing has been presented. The following sections of this concluding chapter correspond to three goals: (1) to briefly summarize key aspects of the comprehensive framework; (2) to describe the benefits of renewed alacrity regarding validation of score meaning and justification of score use; and (3) to reflect on additional research and development work needed to further advance validity theory and practice.
A Comprehensive Approach to Defensible Testing For testing of any sort to be most defensible, a complete and coherent effort that gathers and evaluates two bodies of evidence must be conducted. That effort must address the two essential and distinguishable questions: (1) “What do these scores mean?” and (2) “Should these test scores be used for X?” The first question—the validation question—is answered by gathering and evaluating evidence regarding intended test score inferences. The second question—the justification question—is answered by gathering and evaluating evidence regarding intended test score uses. The approach presented in this volume differentiates between these two equally important questions, describes the different sources of evidence related to each question, and details a coherent process for answering both questions as part of a unified, comprehensive approach to defensible testing (see, e.g., Figure 5.3).
At least one consequence of the comprehensive approach described here is that some modest reconsideration of the sources of evidence for validity is necessary. The extant guidelines in the current Standards are the result of decades of refinement and comprise a valued psychometric tradition for establishing validity. Recommendations for further refinement of those guidelines were presented in Chapter 4. Perhaps more importantly, the framework presented in this volume highlights the absence of parallel traditions or formalized guidance for gathering and evaluating evidence to justify a proposed test use. A beginning proposal for such a framework, grounded in some of the established traditions in the field of program evaluation, was described in Chapter 5. Only when the two key questions above are taken together and addressed adequately via the collection and evaluation of evidence relevant to each concern are claims of defensibility supported.
The Benefits of a Comprehensive Approach A comprehensive framework for defensible testing provides a much needed solution to what have been intractable problems in modern validity theory. The framework resolves the lingering problems first introduced by the conflation of the two key questions—a forced marriage that was doomed from the outset. It provides a theoretical home for issues such as consequences of testing, opportunity to learn, and the contributions of test development and administration processes. Fundamentally, it provides two things that modern validity theory has lacked: (1) a conceptually sound definition for and guidelines regarding what is acknowledged to be the most essential concern in testing—validity; and (2) a conceptually aligned definition and guidelines for what must be viewed as an equally essential concern— justification for the use of a test. Beyond the theoretical benefits of the comprehensive framework are substantial practical benefits. If not sufficiently highlighted in this volume, it is important to acknowledge here: actually doing the work in support of validating an intended test score meaning and justifying an intended test use are daunting tasks. As the saying goes, “If validation and justification were easy, everyone would be doing it.” At least one intended pragmatic benefit of the comprehensive framework is that it will provide a path that actually is easier than that demanded by much of modern validity theory. The chapters in this book have suggested a sound, practical course that can be followed—one that permits increased evidence to show that rigorous validation and justification efforts have been conducted, and increased confidence in the meaning and use of test scores. It is hoped that adoption of the comprehensive framework presented here will have the applied result of actually facilitating everyone doing it. Finally, over the course of the history of validation research, much of the theoretical and practical work on validity has been conducted as somewhat balkanized enterprises; the concepts and practices of validity are sometimes represented differently according to context,
whether student achievement testing, industrial/organizational psychology applications, or professional credentialing examinations. The framework and guidelines described here have intentionally been devised as unifying and applicable to a diversity of applications, including measuring educational achievement, personnel guidance and selection, licensure and certification decisions, and other areas.
Future Research and Development in Validity Theory and Practice One certainty about a volume on validity is that it will soon be outdated. As the six editions of the Standards for Educational and Psychological Testing published over the past 60 years demonstrate, validity theory is continually evolving. This section contains both speculations and recommendations for what that evolution might look like, and what additional research and development efforts will be needed in support of those changes.
Justification of Test Use Chapter 5 of this volume presented an initial framework for considering the second key question that must be answered as part of a sound approach to defensible testing: “Should scores yielded by a test be used for a given purpose?” That is, what is the evidence justifying the proposed use? It is asserted here that formal procedures and accepted standards for relevant evidence bearing on justification are lacking. Surely, much work remains to be done beyond what has been offered in this book. It is likely that the greatest opportunity for advances in an overall approach to defensible testing lies not in refinements to procedures or expanded evidentiary sources for validation, but for establishment of broadly acceptable traditions and warrants for the justification of test use. To that end, a trite admonition seems to capture all that should be said: “More research is needed.”
Validity and Classroom Assessment At least implicitly, the lion’s share of validity research and development has been conducted with large-scale testing (e.g., college admissions testing, licensure and certification testing, personnel selection testing, K–12 achievement testing, etc.) in mind. That focus has yielded necessary and important contributions to what is known about the validity of scores obtained from consequential measurement procedures including cognitive ability tests, job knowledge tests, personality tests, integrity tests, skills assessments, emotional intelligence tests, and physical ability tests that are routinely used in diverse contexts. Just a few obvious examples of these large-scale applications include SAT and ACT tests, structured job interviews, cognitive ability tests such as the Wechsler intelligence scales (Wechsler, 2008; 2014), and statewide every-pupil accountability testing programs mandated by the No Child Left Behind
(2002) and Every Student Succeeds (2015) acts. Considerably less attention has been afforded to validity of classroom assessments. A review of the Standards strongly suggests the conclusion that they were written primarily— or exclusively—to apply to large-scale testing. Only a few authors have provided guidance related to the validity of classroom assessments in K–12 contexts (see e.g., Brookhart & McMillan, 2020; Cizek, 2009; Popham, 2017). Regarding post-secondary contexts, the topic of assessment has received much attention (see, e.g., Brown, Bull, & Pendlebury, 2013; Heywood, 2000), but that attention has typically not included or has given minimal consideration to validity. The peer-reviewed journal Assessment & Evaluation in Higher Education routinely includes articles related to validity in post-secondary contexts, although much of that work has traditionally been centered on the validity of student ratings of instructional effectiveness and much less so on the validity of tests, quizzes, assignments, projects, examinations, and grades in college and university settings. Moreover, as highlighted in the comprehensive approach presented in this volume, defensible testing at any level or scale must attend to both validation and justification. The modest work to date relevant to classroom assessment has focused almost exclusively on the former. Justification for classroom assessment has typically been narrowly defined by the need to assign grades or, more recently, by an interest in obtaining formative information about learning for teachers and students (see Andrade, Bennett, & Cizek, 2019; Andrade & Cizek, 2010). Research and development related to the validity of classroom assessment information should continue. However, given the importance of classroom assessment for promoting student motivation, for providing formative feedback, for decision making regarding student placement, referrals for special services, and for broadening educational and career opportunities for students, it is easy to see that greater attention to justification of classroom assessment use should be a top priority in future research and development.
Group vs. Individual Validity Another characteristic of nearly all theoretical and applied attention to validity has been a focus on the validity of scores for groups of examinees. Much good work in this area has been done, although future attention to other aspects of validity for groups of examinees is needed. For example, reporting on validity can benefit from more description of the samples used for gathering validity evidence, such as the cultural, organizational, and political contexts in which the evidence was gathered, and more complete description of the characteristics of the examinees, their levels of motivation, preparedness of testing, understanding of the purposes of the testing, and of the ways in which the test results will be used. Surely, overall conclusions—supported by evidence—regarding the general validity of
scores yielded by a measurement process are valuable. Indeed, even this volume has considered validation and justification with, primarily, an implicit focus on supporting appropriate interpretations and uses in groups of examinees. However, given the disproportionate attention to validity in the aggregate, and given how many factors can have substantial effects on the validity of individual scores, it seems warranted that increased attention be given to those factors and to methods for identifying and responding to sources of invalidity in those contexts. It appears that the Standards are beginning to take note of this issue. Although much of it actually focuses on the validity of group scores, the Standards includes some encouragement for testing specialists to attend to validity of individual scores as an issue of fairness. According to the Standards: “It is important to keep in mind that fairness concerns the validity of individual score interpretations … ” (2014, p. 53). Increasingly, the issue is also garnering the attention of researchers, and there are several examples of work that has focused on the validity of individual scores. For one, a somewhat recent area of specialization in the field of measurement is that of data forensics (see Cizek & Wollack, 2017; de Klerk, van Noord, & van Ommering; 2019). Wollack and Fremer describe data forensics as the statistical analysis of test takers’ response data with the aim of detecting aberrances (Wollack & Fremer, 2013). Data forensics applied to testing has focused primarily on analyses to reveal aberrances attributable to test security breaches and cheating. Such analyses—particularly analyses targeting answer copying—have been used to investigate the validity of individual scores. Other applications of data forensics to investigate the validity of individual scores beyond situations in which unethical behaviors are suspected are not yet common; however, there is reason to believe that a data-forensic orientation might also be helpful in detecting other sources of invalidity. A second example of attention to the validity of individual scores is suggested by the work done to promote accurate measurement for examinees with disabilities or native speakers of languages other than the language used on a test. Chapter 1 of this volume provides a brief introduction to the concept of testing accommodations which are routinely considered for individual examinees in both educational and credentialing assessment contexts. It is more than just sound testing practice: the Americans with Disabilities Act requires that “testing entities must ensure that the test scores of individuals with disabilities accurately reflect the individual’s aptitude or achievement level or whatever skill the exam or test is intended to measure” (U. S. Department of Justice, 2014, p. 4). Much work remains to be done, though, with respect to the precise accommodations that are relevant across the range of specific disabilities and to the range of severity that exists within a specific disability category. For example, it is often recommended that examinees diagnosed as having a learning disability be afforded an accommodation such as one-and-ahalf or twice the prescribed time to complete an examination. However, it is certain that such
a blanket accommodation is insufficient for some examinees and inappropriately advantages others. Accommodations research has not yet advanced to a state that allows more precise matching of the most common accommodation (i.e., a testing time accommodation) to specific disabilities, much less to matching of other types of accommodation to a range of severity of other disabilities. Finally, the lines of research pursued by Wise (2014, 2017) and others has great potential for improving confidence in the validity of individual scores. This research represents a relatively new approach to validity and includes the development of methods to detect unmotivated test taking, lack of effort or engagement, rapid guessing, and other examinee behaviors which can signal that a test score may not be an accurate reflection of the examinee’s level of knowledge, skill, or ability.
Validation, Justification, and Theories of Action A topic of recent attention in assessment is that of theories of action. It is unclear when the concept was first introduced. A 1999 National Research Council report providing guidance on educational testing identifies theories of action as being common in the field of business and industry and describes theories of action as a “big picture” in educational endeavors that “animates the entire system” (p. 15). Among others, Bennett (2010) and Sireci (2015) have advocated for theories of action as essential validity evidence for both summative and formative assessments. In essence, according to Bennett (2010), a theory of actions comprises the four components listed below. Following each component mentioned by Bennett, the concern addressed by the component is indicated in brackets: (1) a description of the components of the assessment system and a logical, theoretical, or empirical rationale for the inclusion of each component [system coherence]; (2) clear statements regarding the claims that will be made from test scores [validation]; (3) the intended effects of the assessment system and the mechanism accounting for the asserted causal relationships [consequences of use/justification]; and (4) an accounting of the likely unintended negative consequences of testing and what is done to mitigate them [justification of test use]. Certainly, increased explicitness regarding the intended effects of a testing program and a description of plans for detecting and avoiding negative consequences would appear to be reasonable; the question, of course, is how those are related to validity. Regrettably, however, much of the guidance to date concerning theories of action and testing has perpetuated confusion between validation and justification and conflation of meaning and use. The reference to theories of action in the current Standards is particularly troublesome:
It is important to note that the validity of test score interpretations depends not only on the uses of the test scores but specifically on the claims that underlie the theory of action for these uses. (pp. 19–20; see also Standard 1.6) From the perspective of the two key questions that must be answered to promote defensible testing, this commentary in the Standards is perplexing. It is actually important to note that the validity or interpretation of test scores does not depend on their use. Referring back to an example provided previously: If an end-of-year test covering a fourth grade mathematics curriculum is developed with the intention to support claims about students’ mastery of the content of fourth grade mathematics (e.g., fractions, operations, geometry, etc.), and assuming that adequate evidence is collected to support that intended inference, then that meaning is unaffected by whatever use is made of the scores. The validity of those interpretations depends on the evidence gathered in support of those claims, not on a theory of action regarding their use. What is important is that support must be gathered for the intended uses of test scores in addition to support for the score meaning—and to recognize that these distinct concerns require different evidence gathering activities. The mathematics test scores might be justifiably used as some percentage of students’ final grades. They might be used for determining eligibility for special mathematics interventions. They might be used as part of a process for promotion to fifth grade. Whatever use is intended requires evidence justifying that use, but whatever use is contemplated, the meaning of adequately validated scores is unchanged; the scores still only tell us something about the level of students’ mastery of the mathematics curriculum. Although the phrasing is muddled in the current Standards, a recent policy statement adopted by the National Council on Measurement in Education (NCME) provides a somewhat clearer picture of how explicit statements about theories of action can be helpful. The NCME policy recommends formulation and documentation of formal statements of theories of action whenever “program sponsors, designers, or users intend a testing program to effect change” (2018, p. 1). Such statements and documentation should include “evidence in support of the expected (causal) relationships among the program’s constituent parts, the implementation actions, and the intended outcomes, and plan[s] for ongoing evaluation to detect and mitigate unintended, negative consequences” (p. 1). From a program evaluation perspective, such guidance seems wholly appropriate. It clearly behooves an organization responsible for a testing program to understand if the program is accruing intended benefits or prompting anticipated changes. To the extent that the intended benefits or changes occur, that may be a source of evidence for justifying the use of the test. However, it is also possible that the benefits and changes are unrelated to the
presence of the testing program. Herein lies the two-fold caution regarding theories of action. First, as indicated previously, theories of action are relevant only to the aim of justifying a test use, not to validating intended score inferences. To be most useful, it is advisable that adoption and dissemination of a theory of action be focused on the justification aim. Second, the parenthetical insertion of the term “causal” in the NCME position statement on theories of action highlights a reality that is too often overlooked: adoption of a theory of action requires a conscientious, two-part research agenda. The first part of such an agenda is focused on careful study and identification of the factors (including, but not limited to a test) that may contribute to an intended benefit or change. Second, subsequent research is required to support any claim that a test, if purported to be a causal agent, is in fact causally linked to the outcome. Establishing such a causal relationship is typically challenging. There are frequently not only multiple candidates for the causal factor but, as Messick has noted, any single result “almost certainly is multiply determined” (1975, p. 955). That is, organizations typically adopt more than a testing program to achieve their aims. Implementation of professional development opportunities, institution of professional recognition initiatives involving honors or awards, delivery of innovative instructional interventions, changes in organizational leadership or structure, changes in incentives or other consequences associated with test performance: each of these actions (and many others) might individually account for a benefit or change that could be wrongly attributed to a test. Even more likely is that some combination of several of these factors—including a test or not—may be causal. Thus, beyond merely asserting a theory of action and assuming that any changes are attributable to administration of a test, it is a responsibility of the organization to rigorously test that hypothesis. Two implications of the causal nature of theories of action follow. First, some current approaches to a theory of action would appear to make implicit assumptions about singular causation. Thus, it would seem that theories of action are likely to be most feasible, powerful, and useful when they are applied to systems that are the simplest. They are likely to be less useful—or at least considerably more challenging to examine—in complex systems. As applied to testing and its effects on public education systems, a theory of action is sometimes suggested that asserts a positive relationship between a test use and improvement of the public education system or outcomes. In such a case, a first challenge is to identify all of the factors that could be contributing to improvement; a second challenge is isolating the effect of one of those factors (i.e., the test). Given that system-level effect sizes in education are generally very small, and given that there are likely to be many other variables beyond testing that contribute more powerfully to that small effect size, in many cases it may be reasonable to conclude that the research and resources required to confirm the theory of action are not justifiable. A second implication is that those responsible for testing programs should be cautious
about any claims that are made related to outcomes. For example, it may be sufficient to claim that the fourth grade end-of-year mathematics test results in accurate information about student learning in that subject without making additional claims that the presence of the test will improve mathematics instruction, better prepare those students for post-secondary careers, or enhance the global economic competitiveness of the United States; similar claims have surely been asserted by those not responsible for designing and conducting the validation work to support them. Along the same lines, it may be sufficient to make two claims about a licensing examination for physicians: (1) that it accurately measures the knowledge, skills and abilities deemed essential for the practice of medicine and (2) that it provides protection to the public against unsafe or ineffective practitioners—without asserting an additional claim that the exam improves medical education. Should a driver’s license test promote more positive attitudes toward the conservation of natural resources or it is enough for it to claim that it aids in identifying potential drivers who don’t know the rules of the road? In summary, theories of action may be useful in some contexts, but caution and further development are advisable. Specifying a theory of action may be useful to researchers interested in identifying the variables (of which testing would be one) associated with an outcome of interest, and potentially isolating any causal contribution of one or more of those variables. Formulation of a theory of action should not be done cavalierly, but should be done with regard for the validation implications of any claims beyond the primary claim of what the scores on a test are intended to mean.
Toward a Truly Comprehensive Framework The comprehensive framework for defensible testing provided in this volume has paid attention primarily to the validation of intended test score meaning and justification of intended test use. Along the way, another important characteristic of test scores—the dependability of those scores—was considered only briefly. Mention of reliability consisted mainly of noting the position that says in order to even begin consideration of the validity of test scores, some evidence is first necessary that the test produces scores that have an adequate degree of reliability. An entirely separate volume could be written—indeed, much has already been written—on the topic of the dependability of test scores (see e.g., Haertel, 2006; Shavelson & Webb, 1991; Traub, 1994). A truly ambitious conceptualization would address the topic of reliability; it would fully and formally assimilate reliability into a comprehensive framework for defensible testing. The most likely pathway for incorporating reliability would involve subsuming it under the umbrella of validity. The potential for locating reliability within the superordinate concept of validity was alluded to only obliquely in this volume. For example, it was noted in Chapter 2
that some indices routinely used to express reliability (e.g., coefficient alpha, KR-20) can provide evidence based on internal structure—a source of evidence noted in the Standards as bearing on validity. Other aspects of traditional reliability estimation portend the possibility that a stronger and more unified theoretical relationship to validity exists. For one example, weak-to-modest estimates of test/retest reliability suggest that changes in examinee populations or temporal issues may affect interpretations of scores—a validity issue. As another example, parallel forms or equivalent forms reliability estimation procedures that produce weak-to-modest coefficients might signal concerns about the test development process or hint that the forms may not yield scores with the same meaning—again, a validity concern. A third example supporting the possibility of a more unified framework that incorporates both reliability and validity is the well-known formula expressing the upper bound of a criterion-related validity coefficient as the square root of the product of the reliabilities of the predictor and criterion measures. Overall, there are many signs that validity theory will continue to evolve, with one of the most attractive possibilities being a truly “unified” view of measurement—one that fully integrates attention to the value of replication, validation, and justification. Full consideration of such a grand theory is (far) beyond the scope or intention of this book, but desirable nonetheless. The main hope for this volume has been to provide a sound alternative to an unworkable conceptualization of validity that has conflated the key questions of test score meaning and use. The only apparent results of that conceptualization have been continuing controversy and academic hand-wringing about arcane semantic or epistemological nuances that are largely unrelated to the real goals of testing: to obtain information about test takers that is dependable, fair, accurate and useful. Real progress will only be made when theoretical confusion is reduced and controversies subside, allowing improved validation and justification efforts to occur. The comprehensive framework described here was motivated by a single overarching goal: to provide a coherent approach and guidelines for gathering the information needed by those who use test scores to inform important decisions. The future direction of research and development in validation and justification should be judged mainly by how well that goal is accomplished.
REFERENCES
Ainsworth, M. A., Rogers, L. P., Markus, J. F., Dorsey, N. K., Blackwell, T. A., & Petrusa, E. R. (1991). Standardized patient encounters: A method for teaching and evaluation. Journal of the American Medical Association, 266(10), 1390– 1396. doi:10.1001/jama.1991.03470100082037 Altschuld, J. W., & Watkins, R. (2014). A primer on needs assessment: More than 40 years of research and practice. New Directions for Evaluation, 144, 5–18. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education [AERA, APA, & NCME]. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education [AERA, APA, & NCME]. (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association. American Psychological Association. (1954). Technical recommendations for psychological tests and diagnostic techniques. Washington, DC: Author. American Psychological Association. (1966). Standards for educational and psychological tests and manuals. Washington, DC: Author. Andrade, H., Bennett, R., & Cizek, G. J. (Eds.) (2019). Handbook of formative assessment in the disciplines. New York: Routledge. Andrade, H., & Cizek, G. J. (Eds.). (2010). Handbook of formative assessment. New York: Routledge. Andrich, D., & Marais, I. (2019). A course in Rasch measurement theory: Measuring in the educational, social and health sciences. New York: Springer. Austin, J. T., & Villanova, P. (1992). The criterion problem: 1917–1992. Journal of Applied Psychology, 77(6), 836–874. doi:10.1037/0021-9010.77.6.836 Bax, S. (2013). The cognitive processing of candidates during reading tests: Evidence from eye-tracking. Language Testing, 30(4), 441–465. doi:10.1177/0265532212473244 Beck, A. T., Steer, R. A., & Brown, G. K. (1996). Beck depression inventory-II. Bloomington, MN: Pearson Assessments. Bennett, R. E. (2010). Cognitively based assessment of, for, and as learning: A preliminary theory of action for summative and formative assessment. Measurement: Interdisciplinary Research and Perspectives, 8, 70–91. Bock, R. D., Gibbons, R., Schilling, S. G., Muraki, E., Wilson, D. T., & Wood, R. (2008). TESTFACT 4: Test scoring, items statistics, and full-information item factor analysis. Chicago, IL: Scientific Software International. Borsboom, D. (2005). Measuring the mind: Conceptual issues in contemporary psychometrics. Cambridge: Cambridge University Press. Borsboom, D., & Mellenbergh, G. J. (2007). Test validity in cognitive assessment. In J. P. Leighton & M. J. Gierl (Eds.), Cognitive diagnostic assessment for education (pp. 85–116). Cambridge: Cambridge University Press. Borsboom, D., Mellenbergh, G. J., & van Heerden, J. (2004). The concept of validity. Psychological Review, 111, 1061– 1071. Boughton, K. A., Smith, J., & Ren, H. (2016). Using response time data to detect compromised items and/or people. In J. Wollack & G. J. Cizek (Eds.), Handbook of quantitative methods for detecting cheating on tests (pp. 177–190). New York: Routledge. Brannick, M. T., Chan, D., Conway, J. M., Lance, C. E., & Spector, P. E. (2010). What is method variance and how can we cope with it? Organizational Research Methods, 13(3), 407–420. doi:10.1177/1094428109360993 Brennan, R. L. (2006). Perspectives on the evolution and future of educational measurement. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 1–16). Westport, CT: Praeger. Brookhart, S. M., & McMillan, J. H. (Eds.) (2020). Classroom assessment and educational measurement. New York: Routledge.
Brown, G. A., Bull, J., & Pendlebury, M. (2013). Assessing student learning in higher education. London: Routledge. Brown, T. A, DiNardo, P. A., & Barlow, D. H. (1994). Anxiety disorders interview schedule for DSM-IV. San Antonio, TX: Psychological Corporation. Camilli, G. (2006). Test fairness. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 221–256). Westport, CT: Praeger. Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81–105 Castillo, M., & Padilla, J. L. (2013). How cognitive interviewing can provide validity evidence of the response processes to scale items. Social Indicators Research, 114(3), 963–975. doi:10.1007/s11205-012-0184-8. Cattell, R. B. (1946). The description and measurement of personality. Oxford: World Book. Cattell, H. E. P., & Mead, A. D. (2008). The Sixteen Personality Factor Questionnaire (16PF). In G. J. Boyle, G. Matthews, & D. H. Saklofske (Eds.), The Sage handbook of personality theory and assessment (Vol. 2): Personality measurement and testing. Thousand Oaks, CA: Sage. Centers for Disease Control and Prevention [CDC]. (1999). Framework for program evaluation in public health. Morbidity and Mortality Weekly Report, 48(No. RR-11). Atlanta, GA: Author. Chingos, M. M. (2012). Strength in numbers: State spending on K–12 assessment systems. Washington, DC: Brown Center on Educational Policy. Cizek, G. J. (1997). Learning, achievement, and assessment: Constructs at a crossroads. In G. D. Phye (Ed.), Handbook of classroom assessment: Learning, achievement, and adjustment (pp. 1–33). New York: Academic. Cizek, G. J. (1999). Cheating on tests: How to do it, detect it, and prevent it. Mahwah, NJ: Lawrence Erlbaum. Cizek, G. J. (2001). More unintended consequences of high-stakes testing. Educational Measurement: Issues and Practice, 20(4), 19–27. Cizek, G. J. (2009). Reliability and validity of information about student achievement: Comparing the contexts of large scale and classroom testing. Theory Into Practice, 48, 63–71. Cizek, G. J. (2012). Defining and distinguishing validity: Interpretations of score meaning and justifications of test use. Psychological Methods, 17(1), 31–43. Cizek, G. J. (2016). Validating test score meaning and defending test score use: different aims, different methods. Assessment in Education: Principles, Policy & Practice, 23(2), 212–225. doi:10.1080/0969594X.2015.1063479 Cizek, G. J. (2016). Progress on validity: The glass half full, the work half done. Assessment in Education: Principles, Policy & Practice, 23(2), 304–308. doi:10.1080/0969594X.2016.1156642 Cizek, G. J., Bowen, D., & Church, K. (2010, May). Sources of validity evidence for educational and psychological tests: A follow-up study. Paper presented at the annual meeting of the National Council on Measurement in Education, Denver, CO. Cizek, G. J., Kosh, A., & Toutkoushian, E. (2018). Gathering and evaluating validity evidence: The Generalized Assessment Alignment Tool. Journal of Educational Measurement, 55, 477–512. Cizek, G. J., Rosenberg, S. L., & Koons, H. H. (2008). Sources of validity evidence for educational and psychological tests. Educational and Psychological Measurement, 68, 397–412. Cizek, G. J., & Wollack, J. A. (2017). Exploring cheating on tests: The contexts, concerns, and challenges. In G. J. Cizek & J. A. Wollack (Eds.), Quantitative methods for detecting cheating on tests (pp. 3–20). New York: Routledge. Clauser, B. E., & Mazor, K. M. (1998). Using statistical procedures to identify differentially functioning test items. Educational Measurement: Issues and Practice, 17(1), 31–44. doi:10.1111/j.1745–3992.1998.tb00619.x Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Hillsdale, NJ: Lawrence Erlbaum. Collins, D. (2003). Pretesting survey instruments: An overview of cognitive methods. Quality of Life Research, 12, 229– 238. Cone, J. D., & Foster, S. L. (1991). Training in measurement: Always the bridesmaid. American Psychologist, 46, 653–654. Cook, D. A., Brydges, R., Zendejas, B., Hamstra, S. J., & Hatala, R. (2013). Technology-enhanced simulation to assess health professionals: A systematic review of validity evidence, research methods, and reporting quality. Academic Medicine, 88(6), 872–883. Cowles, M., & Davis, C. (1982). On the origins of the .05 level of statistical significance. American Psychologist, 37(5), 553–558. Crocker, L., & Algina, J. (1986). Introduction to classical and modern test theory. New York: Holt, Rinehart and Winston. Cronbach, L. J. (1949). Essentials of psychological testing. Oxford: Harper.
Cronbach, L. J. (1971). Test validation. In R. L. Thorndike (Ed.), Educational measurement (2nd ed., pp. 443–507). Washington, DC: American Council on Education. Cronbach, L. J. (1980). Validity on parole: How can we go straight? New directions for testing and measurement: Measuring achievement over a decade. In Proceedings of the 1979 ETS Invitational Conference (pp. 99–108). San Francisco, CA: Jossey-Bass. Cronbach, L. J. (1988). Five perspectives on validity argument. In H. Wainer & H. Braun (Eds.), Test validity (pp. 2–17). Hillsdale, NJ: Erlbaum. Cronbach, L. J. (1989). Construct validation after 30 years. In R. L. Linn (Ed.), Intelligence: Measurement theory and public policy (pp. 147–171). Urbana: University of Illinois Press. Debra P. v. Turlington. (1979). 474 F. Supp. 244. de Klerk, S., van Noord, S., & van Ommering, C. J. (2019). The theory and practice of educational data forensics. In B. P. Veldkamp & C. Sluijter (Eds.), Theoretical and practical advances in computer-based educational measurement (pp. 381–399). Cham: Springer. doi:10.1007/978-3-030-18480-3_20 Deng, N., & Hambleton, R. K. (2007). Twenty software packages for assessing test dimensionality. Amherst, MA: Center for Assessment, University of Massachusetts-Amherst. Doerfel, M. L. (1998). What constitutes semantic network analysis? A comparison of research and methodologies. Connections, 21(2), 16–26. Donaldson, S. I., Christie, C. A., & Mark, M. M. (Eds.). (2009). What counts as credible evidence in applied research and evaluation practice? Thousand Oaks, CA: Sage. Ebel, R. L. (1961). Must all tests be valid? American Psychologist, 16, 640–647. Ercikan, K., Arim, R., & Law, D., Domene, J., Gagnon, F., & Lacroix, S. (2010). Application of think aloud protocols for examining and confirming sources of differential item functioning identified by expert review. Educational Measurement: Issues and Practice, 29(2), 24–35. doi:10.1111/j.1745-3992.2010.00173.x Ercikan, K., & Pellegrino, J. W. (Eds.) (2017). Validation of score meaning for the next generation of assessments: The use of response processes. New York: Routledge. Ericsson, K. A. (2006). Protocol analysis and expert thought: Concurrent verbalizations of thinking during experts’ performance on representative tasks. In K. A. Ericsson, N. Charness, P. J. Feltovich, & R. Hoffman. (Eds.), Handbook of expertise and expert performance (pp. 223–241). New York: Cambridge University Press. Ericsson K. A., & Simon H. A. (1993). Protocol analysis: Verbal reports as data. Cambridge, MA: MIT Press. Evans, J. St. B. T. (1989). Bias in human reasoning: Causes and consequences. Hillsdale, NJ: Lawrence Erlbaum. Every Student Succeeds Act. (2015). P. L. 114–95, 20 U.S.C. 28 § 1001 et seq. Fitzpatrick, A. R. (1983). The meaning of content validity. Applied Psychological Measurement, 7(1), 3–13. Foster, T. E., Ardoin, S. P., & Binder, K. S. (2018). Reliability and validity of eye movement measures of children’s reading. Reading Research Quarterly, 53(1), 71–89. doi:10.1002/rrq.182 Frisbie, D. A. (2005). Measurement 101: Some fundamentals revisited. Educational Measurement: Issues and Practice, 24(3), 21-28. Geisinger, K. F. (1992). The metamorphosis of test validation. Educational Psychologist, 27, 197–222. Gessaroli, M. E., & De Champlain, A. F. (2005). Test dimensionality: Assessment of. In B. S. Everitt & D. C. Howell (Eds.), Encyclopedia of statistics in behavioral science (pp. 2014–2021). Chichester: John Wiley & Sons. Gewertz, C. (2013, August 7). States ponder price tag of common tests. Education Week, 32(37), 20–22. Glass, G. V., McGaw, B., & Smith, M. L. (1981). Meta-analysis in social research. Beverly Hills, CA: Sage. Guilford, J. P. (1946). New standards for test evaluation. Educational and Psychological Measurement, 6, 427–439 Guion, R. M. (1980). On trinitarian doctrines of validity. Professional Psychology, 11, 385–398. Haertel, E. H. (2006). Reliability. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 65–110). Westport, CT: Praeger. Hattie, J. (1985). Methodology review: Assessing unidimensionality of tests and items. Applied Psychological Measurement, 9, 139–164. Helbig, H. (2006). Knowledge representation and the semantics of natural language. New York: Springer. Heuchert, J. P., & McNair, D. M. (2012). Profile of mood states (2nd ed.). North Tonawanda, NY: Multi-Health Systems. Heywood, J. (2000). Assessment in higher education. London: Jessica Kingsley. International Reading Association & National Council of Teachers of English. (1994). Standards for the assessment of reading and writing. Newark, DE: International Reading Association.
Jonson, J. L., & Plake, B. S. (1998). A historical comparison of validity standards and validity practices. Educational and Psychological Measurement 58, 736–753. Justia. (2019). Evidentiary standards and burdens of proof. Retrieved July 19, 2019 from www.justia.com/trialslitigation/lawsuits-and-the-court-process/evidentiary- standards-and-burdens-of-proof/ Kane, M. T. (1992). An argument-based approach to validation. Psychological Bulletin, 112, 527–535. doi:10.1037/00332909.112.3.527 Kane, M. T. (2001). Current concerns in validity theory. Journal of Educational Measurement, 38, 319–342. Kane, M. T. (2004). Certification testing as an illustration of argument-based validation. Measurement: Interdisciplinary Research and Perspectives, 2(3), 135–170. doi:10.1207/s15366359mea0203_1 Kane, M. T. (2006a). Validation. In R. Brennan (Ed.), Educational measurement (4th ed., pp. 17–64). Westport, CT: Praeger. Kane, M. T. (2006b). Content-related validity evidence in test development. In S. M. Downing & T. M. Haladyna (Eds.), Handbook of test development (pp. 131–153). Mahwah, NJ: Lawrence Erlbaum. Kane, M. T. (2009). Validating the interpretations and uses of test scores. In R. W. Lissitz (Ed.), The concept of validity: Revisions, new directions, and applications (pp. 39–64). Charlotte, NC: Information Age. Kane, M. T. (2012). All validity is construct validity. Or is it? Measurement, 10, 66–70. doi:10.1080/15366367.2012.681977 Kane, M. T. (2015). Explicating validity. Assessment in Education: Principles, Policy & Practice, 23, 198–211. doi:10.1080/0969594X.2015.1060192 Kane, M., & Bridgeman, B. (2018, April). History of validity theory 1950 to the present. Paper presented at the annual meeting of the National Council on Measurement in Education, New York. Kaplan, A. (1964). The conduct of inquiry. San Francisco, CA: Chandler. Kitchin, R. (2001). Cognitive maps. In N. J. Smelzer & P. B. Baltes (Eds.), International encyclopedia of the social & behavioral sciences (pp. 2120–2124). New York: Elsevier. doi:10.1016/B0-08-043076-7/02531-6 Krueger, R. A. (1994). Focus groups: A practical guide for applied research. Thousand Oaks, CA: Sage. Krueger, R.A. (1998). Moderating focus groups. Thousand Oaks, CA: Sage. Kuhn, T. S. (1962). The structure of scientific revolutions. Chicago, IL: University of Chicago Press. Leighton, J. P. (2017). Using think-aloud interviews and cognitive labs in educational research. London: Oxford University Press. Levin, H. M. (1987). Cost-benefit and cost-effectiveness analyses. New Directions for Program Evaluation, 34, 83–99. doi:10.1002/ev.1454 Li, Z., Banerjee J., & Zumbo B. D. (2017). Response time data as validity evidence: Has it lived up to its promise and, if not, what would it take to do so? In B. Zumbo & A. Hubley (Eds.), Understanding and investigating response processes in validation research (pp. 159–178). Cham: Springer. Linn, R. L. (Ed.) (1989). Educational measurement (3rd ed.). New York: American Council on Education, Macmillan. Linn, R. L. (1997). Evaluating the validity of assessments: The consequences of use. Educational Measurement: Issues and Practice, 16(2), 14–16. Lissitz, R. W., & Samuelson, K. (2007). A suggested change in terminology and emphasis regarding validity and education. Educational Researcher, 36(8), 437–448. doi:10.3102/0013189X07311286 Liu, X. L., Primoli, V., & Plackner, C. (2013). Utilization of response time in data forensics of K–12 computer-based assessment. Paper presented at the annual conference on the statistical detection of potential test fraud, Madison, WI. Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological Reports, 3, 635–694. Longino, H. E. (1990). Science as social knowledge: Values and objectivity in scientific inquiry. Princeton, NJ: Princeton University Press. Longino, H. E. (2002). The fate of knowledge. Princeton, NJ: Princeton University Press. Luecht, R. M. (2013). Assessment engineering task model maps, task models and templates as a new way to develop and implement test specifications. Journal of Applied Testing Technology, 14(1). Lyons-Thomas, J. (2014). Using think aloud protocols in the validity investigation of an assessment of complex thinking [PhD Dissertation]. Vancouver: University of British Columbia. Mantel, N., & Haenszel, W. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22, 719–748. Markus, K. A. (1998). Science, measurement, and validity: Is completion of Samuel Messick’s synthesis possible? Social Indicators Research, 45(1), 7–34. Mazor, K. M., Canavan, C., Farrell, M., Margolis, M. J., & Clauser, B. E. (2008). Collecting validity evidence for an
assessment of professionalism: Findings from think-aloud interviews. Academic Medicine, 83(10), S9–S12. doi:10.1097/ACM.0b013e318183e329 McDaniel, M. A., & Nguyen, N. T. (2001) Situational judgment tests: A review of practice and constructs assessed. International Journal of Selection and Assessment, 9(1/2), 103–113. McDonnell, L. M. (1995). Opportunity to learn as a research concept and a policy instrument. Educational Evaluation and Policy Analysis, 17(3), 305–322. doi:10.3102/01623737017003305 Mehrens, W. A. (1997). The consequences of consequential validity. Educational Measurement: Issues and Practice, 16(2), 16–18. Mehrens, W. A., & Cizek, G. J. (2001). Standard setting for decision making: Classifications, consequences, and the common good. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods, and perspectives (pp. 477–485). New York: Lawrence Erlbaum. Messick, S. (1975). The standard problem: Meaning and values in measurement and evaluation. American Psychologist, 30, 955–966. Messick, S. (1988). Assessing the meaning and consequences of measurement. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 33–46). Hillsdale, NJ: Erlbaum. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York: Macmillan. Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50, 741–749. Messick, S. (1998). Test validity: A matter of consequence. Social Indicators Research, 45(1), 35–44. Mishan, E. J., & Quah, E. (2007). Cost-benefit analysis. London: Routledge. doi:10.4324/9780203695678 Mislevy, R. J., Almond, R. G., & Lukas, J. F. (2004). A brief introduction to evidence-centered design (CSE Technical Report 632). Los Angeles, CA: University of California-Los Angeles, National Center for Research on Evaluation, Standards, and Student Testing (CRESST). Mislevy, R. J., & Haertel, G. D. (2006). Implications of evidence-centered design for educational testing. Educational Measurement: Issues and Practice, 25(4), 6–20. Mislevy, R. J., & Riconscente, M. (2005). Evidence-centered assessment design: Layers, structures, and terminology (PADI Technical Report 9). Menlo Park, CA: SRI International. Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). On the structure of educational assessments. Measurement: Interdisciplinary Research and Perspectives, 1, 3–62. Moss, P. A. (1998). The role of consequences in validity theory. Educational Measurement: Issues and Practice, 17(2), 6– 12. Nandakumar, R., Yu, F., Li, H., & Stout. W. (1998). Assessing unidimensionality of polytomous data. Applied Psychological Measurement, 22, 99–115. National Council on Measurement in Education. (2018, July 26). National Council on Measurement in Education (NCME) position statement on theories of action for testing programs. Philadelphia, PA: Author. National Research Council. (1999). Testing, teaching, and learning: A guide for states and school districts. Washington, DC: The National Academies Press. doi:10.17226/9609. Newton, P. E. & Shaw, S. D. (2012, April). We need to talk about validity. Paper presented at the annual meeting of the National Council on Measurement in Education, Vancouver, BC. Nickerson, R. S. (1998). Confirmation bias: A ubiquitous phenomenon in many guises. Review of General Psychology, 2(2), 175–220. No Child Left Behind Act. (2002). P. L. 107-110, 20 U.S.C. 6301. Novak, J. D. & Cañas, A. J. (2008). The theory underlying concept maps and how to construct and use them [Florida Institute for Human and Machine Cognition Technical Report 2006-01, Rev 01-2008]. Retrieved from http://cmap.ihmc.us/docs/pdf/TheoryUnderlyingConceptMaps.pdf O’Leary, T. M., Hattie, J. A. C., & Griffin, P. (2017). Actual interpretations and use of scores as an aspect of validity. Educational Measurement: Issues and Practice, 36(2), 16–23. Pace, C. R. (1972). Review of the Comparative Guidance and Placement Program. In O. K. Buros (Ed.), The seventh mental measurements yearbook (pp. 1026–1028). Highland Park, NJ: Gryphon. Padilla, J., & Benítez, I. (2014). Validity evidence based on response processes. Psicothema, 26(1), 136–144. doi:10.7334/psicothema2013.259 Padilla, J., & Leighton, J. A. (2017). Cognitive interviewing and think aloud methods. In B. Zumbo & A. Hubley (Eds.),
Understanding and investigating response processes in validation research (pp. 211–228). Cham, Switzerland: Springer. Patton, M. Q. (2008). Utilization-focused evaluation. Thousand Oaks, CA: Sage. Phillips, S. E., & Camara, W. J. (2006). Legal and ethical issues. In R. L. Brennan (Ed.), Educational measurement (4th ed., pp. 733–755). Westport, CT: Praeger. Plotnick, E. (1997). Concept mapping: A graphical system for understanding the relationship between concepts [Research Report No. ED407938]. Washington, DC: ERIC Document Reproduction Service. Popham, W. J. (1997). Consequential validity: Right concern, wrong concept. Educational Measurement: Issues and Practice, 16(2), 9–13. Popham, W. J. (2017). Classroom assessment: What teachers need to know (8th ed.). Boston, MA: Pearson Higher Education. Radloff, L. S. (1977). The CES-D scale: A self-report depression scale for research in the general population. Applied Psychological Measurement, 1, 385–401. Rahn, K. (2008, November 11). College Entrance Exam Today. The Korean Times. Retrieved from www.koreatimes.co.kr/www/news/nation/2008/11/117_34287.html Rayner, K. (1998). Eye movements in reading and information processing: 20 years of research. Psychological Bulletin, 124(3), 372–422. Reckase, M. D. (1998). Consequential validity from the test developer’s perspective. Educational Measurement: Issues and Practice, 17(2), 13–16. Reckase, M. D. (2009). Multidimensional item response theory. New York: Springer-Verlag. Robinson, R. (1993). Cost-utility analysis. British Medical Journal, 307, 859–862. doi:10.1136/bmj.307.6908.859 Sackett, P. R., & Yang, H. (2000). Correction for range restriction: An expanded typology. Journal of Applied Psychology, 85(1), 112–118. doi:10.1037/0021-9010.85.1.112 Salamanca, S. L. C. (2017). Consensus about the concept of validity: Final results report. Bogotà: Universidad Nacional de Colombia. Scriven, M. (1995). The logic of evaluation and evaluation practice. New Directions for Evaluation, 68, 49–70. doi:10.1002/ev.1019 Shadish, W. R., Cook, T. D., & Leviton, L. C. (1991). Foundations of program evaluation: Theories of practices. Newbury Park, CA: Sage. Shavelson, R. J., & Webb, N. M. (1991). Generalizability theory: A primer. Newbury Park, CA: Sage. Shepard, L. A. (1993). Evaluating test validity. Review of Research in Education, 19, 405–450. Shepard, L. A. (1997). The centrality of test use and consequences for test validity. Educational Measurement: Issues and Practice, 16(2), 5–8, 13, 24. Sinharay, S., Puhan, G., & Haberman, S. J. (2011). An NCME instructional module on subscores. Educational Measurement: Issues and Practice, 30(3), 29–40. doi:10.1111/j.1745-3992.2011.00208.x Sireci, S. G. (2015). A theory of action for validation. In H. Jiao & R. Lissitz (Eds.), The next generation of testing: Common core standards, Smarter-Balanced, PARCC, and the nationwide testing movement (pp. 251–269). Charlotte, NC: Information Age. Sireci, S. G., & Talento-Miller, E. (2006). Evaluating the predictive validity of Graduate Management Admission Test Scores. Educational and Psychological Measurement, 66, 305–317. Smithson, J. (2007). Using focus groups in social research. In P. Alasuurtari, L. Bickman & J. Brannen (Eds.), The Sage handbook of social research methods (pp. 356–371). Thousand Oaks, CA: Sage. Society for Industrial and Organizational Psychology [SIOP]. (2018). Principles for the validation and use of personnel selection procedures. Washington, DC: American Psychological Association. South Korean College Admission Test. (2008, November 14). Retrieved from https://yeinjee.com/south-korea-collegeadmission-test/ Sowa, J. F. (2000). Knowledge representation: logical, philosophical, and computational foundations. Pacific Grove, PA: Brooks/Cole. Spies, R. A., & Plake, B. S. (Eds.). (2005). The sixteenth mental measurements yearbook. Lincoln, NE: Buros Institute of Mental Measurements. Stake, R. E. (2004). Standards-based and responsive evaluation. Thousand Oaks, CA: Sage. Stout, W., Nandakumar, R., Junker, B., Chang, H., & Steidinger, D. (1992) DIMTEST: A Fortran program for assessing dimensionality of binary item responses. Applied Psychological Measurement, 16(3), 236–236. doi:10.1177/
014662169201600303 Stufflebeam, D. L., & Zhang, G. (2017). The CIPP evaluation model. New York: Guilford. Sudman, S., Bradburn, N. M., & Schwartz N. (1996). Thinking about answers: The application of cognitive processes to survey methodology. San Francisco, CA: Jossey-Bass. Suvorov, R. (2015). The use of eye tracking in research on video-based second language (L2) listening assessment: A comparison of context videos and content videos. Language Testing, 32, 463–483. doi:10.1177/0265532214562099 Svetina, D., & Levy, R. (2012). An overview of software for conducting dimensionality assessment in multidimensional models. Applied Psychological Measurement, 36(8), 659–669. doi:10.1177/0146621612454593 Tabachnick, B. G., & Fidell, L. S. (2019). Using multivariate statistics. New York: Pearson Education. Taylor, D. D., & Sireci, S. (2019, April). Evaluating social consequences in testing: The disconnect between theory and practice. Paper presented at the annual meeting of the National Council on Measurement in Education, Toronto, ON. Tenopyr, M. L. (1966, April). Construct-consequences confusion. Paper presented at the annual meeting of the Society for Industrial and Organizational Psychology, San Diego, CA. Traub, R. E. (1994). Reliability for the social sciences. Thousand Oaks, CA: Sage. Thorndike, R. L. (1949). Personnel selection: test and measurement techniques. New York: Wiley. Uniform Guidelines on Employee Selection Procedures. (1985). 29 C. F. R. 1607. U. S. Department of Education. (2007, December 21). Peer review guidance: Information and examples for meeting requirements of the No Child Left Behind Act of 2001. Washington, DC: Author. U. S. Department of Education. (2018, September 24). A state’s guide to the U.S. Department of Education’s assessment peer review process. Washington, DC: Author. U. S. Department of Justice. (2014). ADA requirements. Washington, DC: Author. Retrieved from www.ada.gov/regs2014/testing_accommodations.pdf Wang, X., & Sireci, S. G. (2013, April). Evaluating the cognitive levels measured by test items using item response time. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA. Webb, N. L. (1997). Criteria for alignment of expectations and assessments in mathematics and science education [Research Monograph No. 8]. Washington, DC: Council of Chief State School Officers. Wechsler, D. (2008). Wechsler Adult Intelligence Scale (4th ed.). Bloomington, MN: Pearson Assessments. Wechsler, D. (2014). Wechsler Intelligence Scale for Children (5th ed.). Bloomington, MN: Pearson Assessments. Wiberg, M., & Sundström, A. (2009). A comparison of two approaches to correction of restriction of range in correlation analysis. Practical Assessment, Research, & Evaluation, 14(5). Wise, S. L. (2014). The utility of adaptive testing in addressing the problem of unmotivated test takers. Journal of Computerized Adaptive Testing, 2, 1–17. doi:10.7333/jcat.v2i0.30 Wise, S. L. (2017). Rapid-guessing behavior: Its identification, interpretation, and implications. Educational Measurement: Issues and Practice, 36(4), 52–61. doi:10.1111/ emip.12165 Wollack, J. A., & Cizek, G. J. (2017). Security issues in professional certification/licensure testing. In S. Davis-Becker & C. W. Buckendahl (Eds.), Testing in the professions: Credentialing policies and practices (pp. 178–209). New York: Routledge. Wollack, J. A., & Fremer, J. J. (2013). Introduction: The test security threat. In J. A. Wollack & J. J. Fremer (Eds.), Handbook of test security (pp. 1–13). New York: Routledge. Wonderlic, Inc. (2019). Cognitive ability assessment. Retrieved December 12, 2019 from https://wonderlic.com/wonscore/cognitive-ability/. Wright, B. D. (1998). Introduction to the Rasch model [video]. University of Chicago, IL: MESA Press. Yarbrough, D. B., Shulha, L. M., Hopson, R. K., & Caruthers, F. A. (2011). The program evaluation standards: A guide for evaluators and evaluation users (3rd ed.). Thousand Oaks, CA: Sage. Zenisky, A. L., & Baldwin, P. (2006). Using item response time data in test development and validation: Research with beginning computer users [Report No. 593]. Amherst, MA: Center for Educational Assessment, University of Massachusetts, School of Education. Zumbo, B. D. (2005). Structural equation modeling and test validation. In B. Everitt & D. C. Howell (Eds.), Encyclopedia of statistics in behavioral science (pp. 1951–1958). Chichester: John Wiley & Sons. Zumbo, B. D. (2007). Validity: Foundational issues and statistical methodology. In C. R. Rao & S. Sinharay (Eds.), Psychometrics (pp. 45–79). Amsterdam: Elsevier Science. Zumbo, B. D. (2009). Validity as contextualized as pragmatic explanation and its implications for validation practice. In R.
W. Lissitz (Ed.), The concept of validity: Revisions, new directions and applications (pp. 65–82). Charlotte, NC: Information Age. Zumbo, B. D., & Hubley A. M. (Eds.). (2017). Understanding and investigating response processes in validation research. Cham: Springer. Zwick, R. (2012). A review of ETS differential item functioning assessment procedures: Flagging rules, minimum sample size requirements, and criterion refinement [Research Report ETS RR-12-08]. Princeton, NJ: Educational Testing Service. Zwick, R. (2017). Who gets in? Strategies for fair and effective college admissions. Cambridge, MA: Harvard University Press. Zwick, R., & Dorans, N. J. (2016). Philosophical perspectives on fairness in educational assessment. In N. J. Dorans & L. F. Cook (Eds.), Fairness in educational assessment and measurement (pp. 267–281). New York: Routledge.
INDEX
Note: References to figures are given in italics; those in bold refer to tables. accommodations 4–6, 5, 168–169 ACT scores 55–56, 91, 167 admissions testing 55–56, 58, 59, 60, 91, 117 AERA (American Educational Research Association) 20, 82, 82 Algina, J. 10 alignment studies 30, 31 Almond, R. G. 33 Americans with Disabilities Act 168–169 Anxiety Disorders Interview Schedule 62 APA (American Psychological Association) 20, 82, 82 argument-based validation 31–32, 32 Assembly Model 33 assessment engineering 31, 31, 33 Assessment & Evaluation in Higher Education 167 assessments 6–7, 166–167, 169 Beck Depression Inventory-II (BDI-II) 61 Benítez, I. 47 Bennett, R. E. 169–170 bias 25, 32, 40, 87, 101, 106, 132, 150–151 Borsboom, D. 19, 79, 109 Bradburn, N. M. 49 Brennan, R. L. 80 Bridgeman, B. 19 bullying 2–3, 4, 34–35, 38 Camilli, G. 127–128, 132, 133 Campbell, D. T. 61, 62, 63, 64 causal inferences 76–77, 169, 170–172 CDC (Centers for Disease Control and Prevention) 112–113, 113 Centers for Epidemiologic Studies Depression Scale 42
certification 31, 44, 116, 117, 119, 125; see also credentialing; licensure CFA (confirmatory factor analysis) 37, 41, 42–43 cheating 50, 106, 116, 117, 168 classical test theory 43, 94, 95–96 classification errors 117–118, 119, 139, 148, 160 classroom assessment 166–167 Clauser, B. E. 40 coefficient alpha 36, 173 cognitive interviewing 48–49, 48, 49 cognitive mapping (concept maps) 48, 51–52, 52, 53 Cohen, J. 145 College Scholastic Ability Test (CSAT) 72–73 comparative value 145, 153, 154 concurrent validity 23, 24, 56–58, 81 confirmation bias 25, 150–151 consequences of testing 23, 67, 68–89, 82, 106; and evidence for test score meaning 67, 106; and evidence for test score use 67, 114, 115–119, 115; intended 67, 70, 116, 119, 135, 155, 157, 159; unintended 67, 70, 76, 78, 116, 117, 119, 135, 137, 139, 142, 155, 157, 159, 161, 170–171 construct-irrelevant variation (CIV) 99–101, 100, 102, 142 construct misspecification 102–103, 142 construct-relevant variation (CRV) 99, 100 construct underrepresentation 69, 102–103 construct validity 14, 16, 23–24, 27, 30, 67, 68, 70, 135, 151 constructs 10–16 content, evidence based on test 27, 28–33, 29, 31, 35, 44, 61, 103, 105, 132, 149–150 content validity 23, 24, 28, 85 convergent evidence 61–65 Cook, D. A. 81 correlation 36, 38, 41, 42, 55, 56–57, 59, 61, 62–65 costs of testing 114, 115, 119–126, 122, 123, 124, 157; cost-benefit analysis 114, 115, 120–122, 122, 157; cost-effectiveness analysis 114, 115, 120, 122–123, 124, 157; costutility analysis 114, 115, 120, 123–124, 124, 157 covariation 15, 36, 77, 138 Cowles, M. 145 creativity 65–66 credentialing 21, 116, 117, 119, 122, 125, 132, 166, 168; see also certification; licensure criterion variable 55–60, 60
Crocker, L. 10 Cronbach, L. J. 21, 23, 25, 47, 52, 63, 151 data forensics 168 Davis, C. 145 De Champlain, A. F. 34 decision making 84, 112, 121, 126, 134, 135, 137, 140, 145 defensible testing 1, 15, 68, 88, 90–107, 108–143, 111, 141, 144, 156, 164–173 Deng, N. 39 dental licensure examination 99–100, 100, 103 depression 11, 30, 42, 56, 61–65, 63, 91, 99, 100, 101, 102, 159 development of tests 22, 34–35, 40, 47–48, 54, 73, 75–76, 79, 92, 102, 104–105, 106, 108–109, 134–135, 137–138, 156; and construct-irrelevant variation 99, 100, 101; costs of 119–120, 125; and evidence based on test content 28, 30, 31, 32–33; and score meaning 13–14, 44, 75, 96, 127, 129, 134, 142, 156; and the Standards 20–21, 21, 105; and validity 4–6, 5, 50, 75, 81, 92, 104–105, 151 DIF (differential item functioning) 40, 42, 43–44, 48, 127 differential test functioning 101, 127 dimensionality 34–43, 37, 38, 39, 98 DIMTEST 39 disconfirming evidence 25, 26, 37–38, 65, 97, 150, 151–152, 156, 161 discriminant evidence 61–65 disparate impact 115, 132–133 driver’s license test 17, 57, 172 due notice 115, 130–132 Ebel, R. 1, 80 ECD (evidence-centered design) 31–33 Educational Measurement 79, 84, 127–128 educational testing 3, 6–7, 16–17, 21, 31, 44, 66, 70, 72–74, 85, 91–92, 91, 117–118, 119–120, 123–124, 129–131, 160, 167, 171–172; see also admissions testing; K–12 education; mathematics testing; Standards for Educational and Psychological Testing EFA (exploratory factor analysis) 37, 39, 42–43 Ercikan, K. 44 Ericsson K. A. 49 ESSA (Every Student Succeeds Act, 2015) 70, 167 Evans, J. St. B. T. 150 evidence for justifying test score use 112, 114–133, 141; based on alternatives to testing 114,
115, 126–127; based on fairness in testing 114, 115, 127–133, 143; based on test consequences 67, 114, 115–119, 115; based on test costs 114, 115, 119–126, 122, 123, 124, 157; see also justification of test score use; use of test scores evidence for meaning of test scores 24, 27–67, 97, 103–106, 135, 141; based on analysis of hypothesized relationships among variables 104, 105, 138, 149; based on internal structure 27–28, 34–44, 37, 38, 39, 41, 103, 104; based on relationships to other variables 27–28, 54–66, 100, 103–104; based on response processes 27, 44–54, 48, 100, 103, 105; based on test consequences 67; based on test content 27, 28–33, 29, 31, 35, 44, 61, 103, 105, 132, 149–150; based on test development and administration 105, 138; convergent 61–65; disconfirming 25, 26, 37–38, 65, 97, 150, 151–152, 156, 161; discriminant 61–65; potentially disconfirming 25, 126, 151–152; weighing the 144–163, 154, 155; see also meaning of test scores Evidence Model 33 eye tracking 48, 50–54 factor analytic methods 37–40, 37, 38, 39, 42, 104 factor loadings 38, 42 fairness 15, 21, 31, 105, 110, 112, 114, 168 fairness, evidence for the use of test scores 114, 115, 127–133, 143; disparate impact 115, 132–133; due notice 115, 130–132; opportunity to learn (OTL) 115, 129–130, 131, 143, 165; stakeholder input 78, 79–80, 113–114, 113, 115, 119, 120, 128–129, 155, 157–158, 162 false negative classifications 117–118, 119, 139, 148, 160 false positive classifications 117–118, 119, 139, 148, 160 Fidell, L. S. 40 Fisher, R. 145 Fiske, D. W. 61, 62, 63, 64 Fitzpatrick, A. R. 28 Florida 130–131 focus groups 3, 47, 48 format of tests 3–4, 12 Fremer, J. J. 168 Frisbie, D. A. 81 FYGPA (Freshman Year GPA) 55–56, 58 Geisinger, K. F. 19, 164 generalizability theory 43 Gessaroli, M. E. 34
Glass, G. V. 145, 153 golf ball packing test 86–87, 102–103 group validity 167–169 Guion, R. M. 23, 85–86, 102 Haertel, E. 144 Haertel, G. D. 33 Hambleton, R. K. 39 Hattie, J. 35 high inference leaps 8–9 Hubley, A. 44 IEP (Individualized Educational Planning) 7 individual validity 167–169 inference 1, 7–10, 12, 13, 14–15, 16, 21–23, 22, 25–28, 29, 71–72, 74, 75–76, 84–86, 92–96, 94, 95, 164; tentative nature of 9–10, 13; see also meaning of test scores inferential errors 147, 150–151, 154 inferential leaps 8–9, 148 instructional quality 38–39, 39, 41–42, 41 intelligence tests 126, 167 intended consequences of test use 67, 70, 116, 119, 135, 155, 157, 159 internal consistency analysis 36, 104 interviews 4, 12, 48–49, 48, 49, 126 IRT (item response theory) 8, 40, 43 Jonson, J. L. 19 justification of test score use 108–143, 111, 115, 164–165, 166, 170; and theories of action 170, 171; and validation of test score meaning 91, 91, 107, 109, 110, 110, 111, 114, 133–140, 134, 141, 142; weighing the evidence 144–146, 155–163, 155 K–12 education 48, 73, 81–82, 98, 119–120, 167 Kane, M. 15, 16, 19, 25, 32, 78, 79–80, 81, 83, 88, 105, 108, 128, 129, 137, 144, 151 Koons, H. H. 81 Korea 72–73 KR-20 (Kuder Richardson Formula 20) 36, 173 leaps, inferential 8–9, 148 Levy, R. 39
licensure 2, 3, 4, 30–31, 34–35, 43, 44, 99–100, 100, 103, 116, 117, 118, 119, 132, 148, 172; see also certification; credentialing Linn, R. L. 83 LISREL 42 Lissitz, R. W. 19 Loevinger, J. 24 low inference leaps 8–9 Luecht, R. M. 33 Lukas, J. F. 33 Mantel-Haenszel procedure 40 mathematics testing 9, 28, 35, 45–46, 46, 48, 52–54, 74, 98, 123–124, 124, 129–130, 149, 170, 172 Mazor, K. M. 40, 48 McDonnell, L. M. 129–130 meaning of test scores 13–14, 16–17, 111, 134, 135, 140, 141, 142, 143, 164; conflation with test score use 15, 85, 88, 91, 106–107, 109, 165, 170, 173; see also evidence for meaning of test scores; inference; validation measurement error 42, 117, 142, 155 Mehrens, W. A. 74, 158 Mellenbergh, G. J. 79–80, 109 Messick, S. 14, 16, 21, 23, 24, 25, 26, 27, 68–70, 71, 74, 78, 81, 84, 85, 97, 126, 162, 171 method variance 64 Mislevy, R. J. 33 MMY (Mental Measurements Yearbook) 81 Moss, P. A. 78 MTMM (“multi-trait, multi-method”) 62–65, 63 multidimensionality 34, 35, 36, 37–38, 38, 39, 43, 98 multiple-choice questions 2, 3, 9, 12, 49, 119, 122 NCLB (No Child Left Behind Act, 2002) 70, 82, 167 NCME (National Council on Measurement in Education) 20, 82, 82, 170–171 Newton, P. E. 23 Nickerson, R. S. 150 opportunity costs 125 OTL (opportunity to learn) 115, 129–130, 131, 143, 165
Pace, C. R. 79 Padilla, J. 47 PARCC (Partnership for Assessment of Readiness for College and Careers) 120, 121 PCA (principal components analysis) 37, 40 Pellegrino, J. W. 44 Plake, B. S. 19 Poly-DIMTEST 39 Popham, W. J. 74 potentially disconfirming evidence 25, 126, 151–152 predictive validity 23, 24, 56, 58–61, 60, 75, 149 Presentation Model 33 Principles for the Validation and Use of Personnel Selection Procedures 83–84, 87 Profile of Mood States-II 62 program evaluation 21, 109, 112–114, 113, 120, 123, 128, 134, 171 Program Evaluation Standards 112–113, 114 psychological testing 19, 30–31, 31, 44, 51, 56, 65, 66, 78–79, 147–148, 150, 166; see also depression; Standards for Educational and Psychological Testing Rayner, K. 50 Reckase, M. D. 76 reliability 20, 21, 32, 36, 43, 61, 63, 64, 67, 94–96, 145, 155–156, 172–173 response processes, evidence based on 27, 44–54, 48, 100, 103, 105 response time 48, 49–50 Riconscente, M. 33 Rosenberg, S. L. 81 rotated factor matrix 38–39, 39 Sackett, P. R. 61 samples, tests as 2–4, 7, 13 Samuelson, K. 19 SAT scores 55–56, 91, 167 Schwartz, N. 49 scoring of tests 3–4, 21, 31, 44, 62, 105, 106; see also meaning of test scores; use of test scores scree plots 37–38, 37, 38, 42 selection paradox 59 SEM (structural equation modeling) 40, 41–42, 41 semantic network analysis 48, 51
SES (socioeconomic status) 65–66 Shaw, S. D. 23 Shepard, L. A. 23, 27, 70, 78 “show your work” 48, 52–54 Simon, H. A. 49 SIOP (Society for Industrial and Organizational Psychology) 83–84, 87, 90 Sireci, S. G. 59, 81, 169 Smarter Balanced 120, 121 software 35, 39–40, 42, 122, 123 Spearman-Brown formula 36 stakeholder input 78, 79–80, 113–114, 113, 115, 119, 120, 128–129, 155, 157–158, 162 Standards for Educational and Psychological Testing 4, 15, 100, 105, 110, 112, 136, 144, 165, 166, 167, 170; on consequences of test use 82, 116, 142, 161; on fairness 127, 168; on sources of evidence 27, 28, 44, 49, 61, 67, 69, 81, 82, 100, 103–104, 106, 173; on test development 20–21, 21, 105; on validity 1, 19–21, 21, 22, 23–24, 85, 88, 168 Standards for the Assessment of Reading and Writing 70 Steinberg, L. S. 33 Stout, W. 39 Student or Competency Model 32 subscore analysis 36 Sudman, S. 49 Sundström, A. 61 Svetina, D. 39 sympathetic resonance 12–13 Tabachnick, B. G. 40 Talento-Miller, E. 59 Task Model 33 Taylor, D. D. 81 Technical Recommendations for Psychological Tests and Diagnostic Techniques 20 temporal flaw in consequential validity 75–76 Tenopyr, M. L. 69 test administration 4–6, 5, 20, 21, 104–105, 106, 119, 121, 138, 165 test content see content, evidence based on test test development see development of tests test scoring see scoring of tests test use see use of test scores TESTFACT 39
tests: definition 2–7; format 3–4, 12; as samples 2–4, 7, 13; standardized 4–6, 5 theories of action 169–172 think-aloud protocols 48–49, 48, 100 Thorndike, R. L. 59–60 tuning forks 12, 14, 36 unidimensionality 35, 36, 37–38, 37, 39, 40, 42, 43, 98 unintended consequences of test use 67, 70, 76, 78, 116, 117, 119, 135, 137, 139, 142, 155, 157, 159, 161, 170–171 unintended uses of test scores 161 unitary concept, validity as 22, 23–24, 27, 30, 84 use of test scores 15, 21; conflation with test score meaning 15, 85, 88, 91, 106–107, 109, 165, 170, 173; and consequential validity 67, 68–89; unintended uses 161; and validation of test score meaning 91, 91, 107, 109, 110, 110, 111, 114, 133–140, 134, 141, 142; see also evidence for justifying test score use; justification of test score use validation 25–27, 30–31, 32–33, 32, 55, 58–59, 65–66, 82, 82, 90–107, 91, 97, 111, 130, 141, 143; definition of 16, 22, 22, 26–27, 135, 151; and justification of test score use 91, 91, 107, 109, 110, 110, 111, 114, 133–140, 134, 141, 142 validity 16–17, 19–67, 82, 82; concurrent 23, 24, 56–58, 81; consequential 23, 67, 68–89, 82, 106; construct 14, 16, 23–24, 27, 30, 67, 68, 70, 135, 151; content 23, 24, 28, 85; definition of 1, 13–15, 17, 92; group vs. individual 167–169; predictive 23, 24, 56, 58–61, 60, 75, 149; as unitary concept 22, 23–24, 27, 30, 84; see also consequences of testing validity arguments 31–32, 32, 139 validity coefficients 42, 59, 62, 98, 146, 173 value considerations 97–98, 110, 112, 117–118, 125–126, 139, 145–146 van Heerden, J. 79 variation 13, 14–15, 36, 92–96, 94, 95, 98, 138; construct-(ir)relevant 99–101, 100, 142; see also covariation varimax rotated factor matrix 38, 39 weighing the evidence 144–163, 154, 155 Wiberg, M. 61 Wise, S. L. 169 Wollack, J. A. 168 Wright, B. 8, 22 Yang, H. 61
Zumbo, B. D. 24, 42, 44, 108 Zwick, R. 40