Tests & Measurement for People Who (Think They) Hate Tests & Measurement [4 ed.] 9781071817179, 9781071817193, 9781071817209, 9781071817186


132 48 14MB

English Pages [361] Year 2023

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Tests & Measurement for People Who (Think They) Hate Tests & Measurement [4 ed.]
 9781071817179, 9781071817193, 9781071817209, 9781071817186

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Salkind Frey

About SAGE Founded in 1965, SAGE is a leading independent academic and professional publisher of innovative, high-quality content. Known for our commitment to quality and innovation, SAGE has helped inform and educate a global community of scholars, practitioners, researchers, and students across a broad range of subject areas.

Tests & Measurement for People Who & Measurement

Learn more about SAGE teaching and learning solutions for your course at sagepub.com/collegepublishing.

) Hate Tests

We are here for you.

(Think They

Teaching isn’t easy. | Learning never ends.

FOURTH EDITION

Cover image: iStock.com/TARIK KIZILKAYA

Tests & Measurement ) y e h T k for People Who (Thin Hate Tests & Measurement

Neil J. Salkind Bruce B. Frey

••••••••••••••••••••••••••••••••••••••••••••••••••••

FOURTH EDITION

Tests & Measurement for People Who (Think They) Hate Tests & Measurement Fourth Edition

To Mrs. Hank Snow. Without you, I would have had no country music career!

Tests & Measurement for People Who (Think They) Hate Tests & Measurement Fourth Edition

Neil J. Salkind Bruce B. Frey University of Kansas

FOR INFORMATION:

Copyright © 2023 by SAGE Publications, Inc.

SAGE Publications, Inc.

All rights reserved. Except as permitted by U.S. copyright law, no part of this work may be reproduced or distributed in any form or by any means, or stored in a database or retrieval system, without permission in writing from the publisher.

2455 Teller Road Thousand Oaks, California 91320 E-mail: [email protected] SAGE Publications Ltd. 1 Oliver’s Yard 55 City Road London EC1Y 1SP United Kingdom

All third-party trademarks referenced or depicted herein are included solely for the purpose of illustration and are the property of their respective owners. Reference to these trademarks in no way indicates any relationship with, or endorsement by, the trademark owner. Printed in the United States of America

SAGE Publications India Pvt. Ltd. B 1/I 1 Mohan Cooperative Industrial Area

Library of Congress Cataloging-in-Publication Data

Mathura Road, New Delhi 110 044

Names: Salkind, Neil J., author. | Frey, Bruce B., author.

India

Title: Tests & measurement for people who (think they) hate tests & measurement / Neil J. Salkind, Bruce B. Frey, University of Kansas.

SAGE Publications Asia-Pacific Pte. Ltd. 18 Cross Street #10-10/11/12 China Square Central Singapore 048423

Other titles: Tests and measurement for people who (think they) hate tests and measurement Description: Fourth edition. | Thousand Oaks, California : SAGE Publishing, [2023] | Includes index. | Identifiers: LCCN 2022027687 | ISBN 9781071817179 (paperback) | ISBN 9781071817193 (epub) | ISBN 9781071817209 (epub) | ISBN 9781071817186 (pdf) Subjects: LCSH: Educational tests and measurements. Classification: LCC LB3051 .S243 2023 | DDC 371.26—dc23 LC record available at https://lccn.loc.gov/2022027687 This book is printed on acid-free paper.

Acquisitions Editor:  Helen Salmon Editorial Assistant:  Yumna Samie Production Editor:  Astha Jaiswal Copy Editor:  Taryn Bigelow Typesetter:  C&M Digitals (P) Ltd. Indexer: Integra Cover Designer:  Candice Harman Marketing Manager:  Victoria Velasquez

22 23 24 25 26 10 9 8 7 6 5 4 3 2 1

BRIEF CONTENTS Preface xv Acknowledgments xix About the Authors

PART I

• THE BASICS

xxi

1

Chapter 1 •

Why Measurement? An Introduction

Chapter 2 •

Levels of Measurement and Their Importance: One Potato, Two Potato

21

Reliability and Its Importance: Getting It Right Every Time

37

Validity and Its Importance: The Truth, the Whole Truth, and Nothing but the Truth

67

Scores, Stats, and Curves: Are You Hufflepuff or Ravenclaw?

85

Item Response Theory: The “New” Kid on the Block

113

Chapter 3 • Chapter 4 • Chapter 5 • Chapter 6 •

PART II • TYPES OF TESTS

3

129

Chapter 7 •

Achievement Tests: Is Life a Multiple-Choice Test?

131

Chapter 8 •

Aptitude Tests: What’s in Store for Me?

149

Chapter 9 •

Intelligence Tests: Am I Smarter Than My Smart Phone?

161

Personality and Neuropsychology Tests: It’s Not You, It’s Me

177

Career Choices: Have We Got a Job for You!

197

Chapter 10 • Chapter 11 •

PART III • CLASSROOM ASSESSMENT Chapter 12 • Chapter 13 •

211

Picking the Right Answer: Choose Your Own Destiny

213

Building the Right Answer: Construction Work Ahead

227

PART IV • RESEARCHER-MADE INSTRUMENTS Chapter 14 •

Surveys and Scale Development: What Are They Thinking?

PART V • FAIR TESTING Chapter 15 • Chapter 16 •

241 243

259

Truth and Justice for All: Test Bias and Universal Design

261

Laws, Ethics, and Standards: The Professional Practice of Tests and Measurement

275

Appendix A: The Guide to Finding Out About (Almost) Every Test in the Universe

299

Appendix B: Answers to Practice Questions

305

Glossary 321 Index 327

DETAILED CONTENTS Preface xv Acknowledgments xix About the Authors

xxi

PART I  •  THE BASICS

1

Chapter 1  •  Why Measurement? An Introduction

3

• Learning Objectives

3

A Five-Minute History of Testing

5

So, Why Tests and Measurement?

8

What We Test Why We Test What Makes a Good Test Any Good?

9 11 12

Some Important Reminders

12

How Tests Are Created

13

So What’s New?

13

What Am I Doing in a Tests and Measurements Class?

15

Ten Ways Not to Hate This Book (and Learn About Tests and Measurement at the Same Time!)

16

The Famous Difficulty Index

19

Glossary 19 Summary 19 Time to Practice

Chapter 2  •  Levels of Measurement and Their Importance: One Potato, Two Potato • Learning Objectives

19

21 21

First Things First

22

The Four Horsemen (or Levels) of Measurement

23

The Nominal Level of Measurement The Ordinal Level of Measurement The Interval Level of Measurement The Ratio Level of Measurement

23 25 26 28

A Summary: How Levels of Measurement Differ

29

Okay, So What’s the Lesson Here?

30

The Final Word(s) on Levels of Measurement

31

Summary 32

Time to Practice

33

Want to Know More?

33

Further Readings And on Some Interesting Websites And in the Real Testing World

Chapter 3  •  Reliability and Its Importance: Getting It Right Every Time

33 34 34

37

• Learning Objectives

37

Test Scores: Truth or Dare

39

Getting Conceptual

40

Putting a Number on Reliability

42

Computing a Simple Correlation Coefficient

Different Flavors of Reliability

42

46

Test–Retest Reliability 47 Interrater Reliability 49 Parallel Forms Reliability 50 Internal Consistency Reliability 51 Cronbach’s Alpha (or α) 56 The Last One: Internal Consistency When You’re Right or Wrong, and Kuder-Richardson 57

How Big Is Big? Interpreting Reliability Coefficients

59

Things to Remember

60

And If You Can’t Establish Reliability . . . Then What?

61

Just One More Thing (and It’s a Big One)

62

Summary 62 Time To Practice

63

Want to Know More?

64

Further Readings And on Some Interesting Websites And in the Real Testing World

Chapter 4  •  Validity and Its Importance: The Truth, the Whole Truth, and Nothing but the Truth

64 64 64

67

• Learning Objectives

67

A Bit More About the Truth

68

Reliability and Validity: Very Close Cousins

Different Types of Validity Arguments Content-Based Validity Criterion-Based Validity Construct-Based Validity And If You Can’t Establish Validity . . . Then What?

A Last Friendly Word

69

70 71 73 76 81

81

Summary 82 Time to Practice

82

Want to Know More?

83

Further Readings And on Some Interesting Websites And in the Real Testing World

Chapter 5  •  Scores, Stats, and Curves: Are You Hufflepuff or Ravenclaw? • Learning Objectives

The Basics: Raw (Scores) to the Bone! Percentiles (or Percentile Ranks) What’s to Love About Percentiles What’s Not to Love About Percentiles

83 83 84

85 85

87 88 91 91

Looking at the World Through Norm-Referenced Glasses

91

Computing the Mean Some Things to Remember About the Mean Computing the Mode Understanding Variability Computing the Standard Deviation Some Things to Remember About the Standard Deviation

92 93 95 95 96 98

The Normal Curve (or the Bell-Shaped Curve) More Normal Curve 101

The Standard Stuff

99 100

103

Our Favorite Standard Score: The z Score What’s to Love About z Scores What’s Not to Love About z Scores T Scores to the Rescue

103 106 106 106

Standing on Your Own: Criterion-Referenced Tests The Standard Error of Measurement

108 108

What the SEM Means

109

Summary 110 Time to Practice Want to Know More? Further Readings And on Some Interesting Websites And in the Real Testing World

Chapter 6  •  Item Response Theory: The “New” Kid on the Block • Learning Objectives

110 111 111 112 112

113 113

The Beginnings of Item Response Theory

114

This Is No Regular Curve: The Item Characteristic Curve

116

Test Items We Like—and Test Items We Don’t

117

Understanding the Curve

120

Putting a, b, and c Together

Analyzing Test Data Using IRTPRO Seeing Is Believing

122

123 125

Summary 126 Time to Practice

127

Want to Know More?

127

Further Readings And on Some Interesting Websites And in the Real Testing World

PART II  •  TYPES OF TESTS Chapter 7  •  Achievement Tests: Is Life a Multiple-Choice Test? • Learning Objectives

How Achievement Tests Are Used Teacher-Made (or Researcher-Made) Tests Versus Standardized Achievement Tests

127 128 128

129 131 131

132 133

Criterion- Versus Norm-Referenced Tests

134

How to Do It: The ABCs of Creating a Standardized Test

136

The Amazing Table of Specifications

138

What They Are: A Sampling of Achievement Tests and What They Do

142

Validity and Reliability of Achievement Tests

143

Summary 146 Time to Practice

146

Want to Know More?

147

Further Readings And on Some Interesting Websites And in the Real Testing World

Chapter 8  •  Aptitude Tests: What’s in Store for Me? • Learning Objectives

What Aptitude Tests Do

147 147 148

149 149

150

How to Do It: The ABCs of Creating an Aptitude Test

151

Types of Aptitude Tests

154

Mechanical Aptitude Tests Artistic Aptitude Tests Readiness Aptitude Tests Clerical Aptitude Tests

154 154 154 155

What They Are: A Sampling of Aptitude Tests and What They Do

155

Validity and Reliability of Aptitude Tests

158

Summary 158 Time to Practice

158

Want to Know More?

159

Further Readings And on Some Interesting Websites And in the Real Testing World

159 159 160

Chapter 9  •  Intelligence Tests: Am I Smarter Than My Smart Phone?

161

• Learning Objectives

The ABCs of Intelligence The Big g More Than Just the Big g: The Multiple-Factor Approach

161

161 162 163

“Book Smart” or “Street Smart”? The Three-Way Deal Way More Than One Type of Intelligence: Howard Gardner’s Multiple Intelligences Emotional Intelligence: An Idea That Feels Right

From the Beginning: (Almost) All About the Stanford–Binet Intelligence Scale A Bit of History of the IQ What’s the Score? Administering the Stanford–Binet (and Other Tests of Intelligence)

164 164 165 166

167 167 169

And the Fab Five Are . . .

170

Validity and Reliability of Intelligence Tests

170

Summary 173 Time to Practice

173

Want to Know More?

174

Further Readings And on Some Interesting Websites And in the Real Testing World

Chapter 10  •  Personality and Neuropsychology Tests: It’s Not You, It’s Me • Learning Objectives

What Personality Tests Are and How They Work Objective or Projective: You Tell Me

174 174 175

177 177

178 178

Developing Personality Tests

181

Using Content and Theory Using a Criterion Group

181 182

The Many Dimensions of Factor Analysis

184

What They Are: A Sampling of Personality Tests and What They Do

186

What Neuropsychological Tests Are and How They Are Used

189

Not Just One: The Focus of Neuropsychological Testing

190

Intelligence 190 Memory 191 Language 191 Executive Function 191 Visuospatial Ability 192

Forensic Assessment: The Truth, the Whole Truth, and Nothing but the Truth

192

What Forensic Assessment Does

192

Validity and Reliability of Personality Tests

193

Summary 194 Time to Practice

194

Want to Know More?

195

Further Readings And on Some Interesting Websites And in the Real Testing World

195 195 196

Chapter 11  •  Career Choices: Have We Got a Job for You! • Learning Objectives

What Career Development Tests Do

197 197

198

Let’s Get Started: The Strong Interest Inventory

199

John Holland and the Self-Directed Search

201

Some Major Caveats: Career Counseling 101

203

Five Career Tests

204

Validity and Reliability of Career Development Tests

207

Summary 207 Time to Practice

207

Want to Know More?

208

Further Readings And on Some Interesting Websites And in the Real Testing World

208 209 209

PART III  •  CLASSROOM ASSESSMENT

211

Chapter 12  •  Picking the Right Answer: Choose Your Own Destiny

213

• Learning Objectives

213

Your Old Friends

214

Multiple-Choice Items

214

Multiple-Choice Anatomy 101 How to Write Multiple-Choice Items: The Rules

Matchmaker, Matchmaker, Make Me a Match How to Write Matching Items: The Rules

Are You Lying Now, or Were You Lying Then? How to Write ’Em: The Guidelines

Supply Items That Score Like Selection Items Are Fill in the ________? Selected Rules for These Objectively Scored Supply Items

Validity and Reliability of Objectively Scored Items

215 215

217 218

218 219

220 220

222

Summary 223 Time to Practice

223

Want to Know More?

224

Further Readings And on Some Interesting Websites And in the Real Testing World

Chapter 13  •  Building the Right Answer: Construction Work Ahead • Learning Objectives

224 224 224

227 227

Performance Anxiety: Measuring Skill and Ability

228

Putting It Together: Constructed-Response Items

228

What Constructed-Response Items Look Like Designing Good Constructed-Response Items

229 230

Doing Things the Write Way How to Write Essay Items: The Guidelines

Quality Is Job One: Rubrics What Does a Good Rubric Look Like? What’s So Great About Rubrics?

231 232

233 233 235

More Than a Number: Portfolios

235

Validity and Reliability of Constructed-Response and Performance-Based Items

237

Summary 238 Time to Practice

238

Want to Know More?

239

Further Readings And on Some Interesting Websites And in the Real Testing World

239 239 239

PART IV  •  RESEARCHER-MADE INSTRUMENTS

241

Chapter 14  •  Surveys and Scale Development: What Are They Thinking?

243

• Learning Objectives

Surveying the Landscape Steps in Scale Development

How Do I Ask It in the Form of a Question (for $200, Alex)? Questions About Facts: Where Were You on the Night of the 7th?! Questions About Attitudes: You Feel Me?

Our Feelings About Attitude: Most Like Likert (but Some Like Thurstone More) The Thurstone Method

Don’t Ignore Me! Strategies for Increasing Response Rate

243

244 244

246 246 248

249 251

253

Summary 255 Time to Practice

256

Want to Know More?

256

Further Readings And on Some Interesting Websites And in the Real Testing World

PART V  •  FAIR TESTING Chapter 15  •  Truth and Justice for All: Test Bias and Universal Design • Learning Objectives

The $64,000 Question: What Is Test Bias? Test Bias or Test Fairness?

256 256 256

259 261 261

262 263

Moving Toward Fair Tests: The FairTest Movement

264

Models of Test Bias

265

The Difference-Difference Bias

265

Item by Item On the Face of Things Model The Cleary Model Playing Fair

Universal Design Riddle Me This: When Are a Building and a Test the Same? Designing the Best Tests in the Universe

266 267 268 268

269 270 270

Summary 272 Time to Practice

272

Want to Know More?

272

Further Readings And on Some Interesting Websites And in the Real Testing World

Chapter 16  •  Laws, Ethics, and Standards: The Professional Practice of Tests and Measurement • Learning Objectives

What the Government Says Essa and Nickleby: The Every Student Succeeds Act (ESSA) and No Child Left Behind (NCLB) Full Inclusion and Universal Access: The Education for All Handicapped Children Act and the Individuals With Disabilities Education Act The Truth in Testing Law: High-Stakes Testing

272 273 273

275 275

276 276 278 280

Family Educational Rights and Privacy Act: What It Is and How It Works 282 The Right Way to Do Right From Whence We Came

How the Pros Roll What’s in the Standards?

And More Stuff to Be Concerned About (No, Really) The Flynn Effect: Getting Smarter All the Time Teacher Competency: So You Think You’re Ready for the Big Time? School Admissions: Sorry, No Room This Year Cyril Burt: Are We Born With It?

283 283

289 289

290 291 291 292 293

Summary 294 Time to Practice

294

Want to Know More?

294

Further Readings And on Some Interesting Websites And in the Real Testing World

294 295 295

Appendix A: The Guide to Finding Out About (Almost) Every Test in the Universe

299

Appendix B: Answers to Practice Questions

305

Glossary 321 Index 327

PREFACE A NOTE TO THE STUDENTS This is the fourth edition of this book. My friend, mentor, and colleague Neil Salkind was the author of the earlier editions, as well as other popular books in the For Those Who (Think They) Hate series. He created the concept of these friendly books that tackled topics that are famously frightening or, at the very least, have the reputation of being difficult, and presenting them in understandable, nonthreatening ways. He literally wrote books that “made learning fun.” I was honored that Neil and SAGE chose me to be the new coauthor of this book and others, to continue his legacy after he passed away much too young in 2017. Thousands of students just like you have found previous editions of this book to be helpful and useful and even, sometimes, funny. As with Statistics for People Who (Think They) Hate Statistics, Neil and I received a great deal of satisfaction helping others understand the kinds of material contained in these pages. Let me share the magic advice we have given in earlier editions—take things slowly, listen in class, work hard, and you’ll do fine. Tests and measurement courses like the one you are probably enrolled in tend to find students generally anxious, but not very well informed about what’s expected of them. Of course, like any worthwhile topic, learning about tests and measurement takes an investment of time and effort (and there is still the occasional monster for a teacher. But from what Bruce has heard, your teacher’s not so bad.). Here’s the thing, though. Most of what students have heard (and where most of the anxiety comes from)—that courses like this are unbearably difficult—is just not true. So many fear-struck students have succeeded where they thought they would fail. They did it by taking one thing at a time, pacing themselves, seeing illustrations of basic principles as they are applied to real-life settings, and even having some fun along the way. The result? A new set of tools and a more informed consumer and user of tests to evaluate all kinds of behaviors critical in all endeavors in the social and behavioral sciences from teaching to research to evaluation to diagnosis. So, what’s in store for you in these revised pages is partly what was in earlier editions, but, this time, with a few pretty big changes; it’s the information you need to understand what the field and study of basic tests and measurement are about. xv

xvi   Tests & Measurement for People Who (Think They) Hate Tests & Measurement

You’ll learn the fundamental ideas about testing and tests, and how different types of tests are created and used. There’s some theory, but most of what we do in these pages focuses on the most practical issues facing people who use tests, such as what kinds of tests are available, what kinds should be used and when, how tests are created and evaluated, and a whole lot about what test scores mean. There’s a bit of math required but very little. (It turns out you can’t talk about measurement without talking about scores!) Anxious about math? Don’t worry, there are a whole lot more words in the pages to come than there are numbers. The more advanced tests and measurement material are very important, but you won’t find it here. Why? Because at this point in your studies, we want to offer you material at a level we think you can understand and learn with some reasonable amount of effort, while at the same time not being scared off from taking future courses. You won’t learn how to calculate the parameters of an item characteristic curve (whatever that is) or use diagnostic classification models (whatever those are) in this book, but if you want to learn why and how tests and measurement can work, and then to understand the material you read in journal articles and what it means to you as a test taker and a test user, this is exactly the place. Check out the helpful eflashcards of key terms from the glossary at https://edge .sagepub.com/salkindtm4e. Good luck, and let Bruce know how he can improve this book to even better meet the needs of people just like you. Send him a note at [email protected].

A NOTE TO THE INSTRUCTOR Thank you for choosing this book for your students (or, if you are looking through this preface and only considering choosing this book, then hurry up and buy it— this ain’t a lending library, you know!). I (that is, Bruce) am so proud to have the opportunity to continue the legacy of Neil’s SAGE books that have helped so many terrified students survive courses that they thought would be boring or hard or both. Of course, most of their success will be due to you, and this book is designed to help you help them. If you are familiar with earlier editions of this book or our other books in the series, you know that even though we are all about approaching these topics in an . . . um  . . . approachable way, Tests & Measurement for People Who (Think They) Hate Tests & Measurement is not meant to be a dumbed-down book similar to others you may have seen. Nor is the title meant to convey anything other than the fact that many students new to the subject are often very anxious about what’s to come. This is not an academic version of a book for dummies or anything of its kind. We have made every effort to address students with the respect they deserve,

Preface  xvii

to not patronize them, and to ensure that the material is approachable. How well we’ve done in these regards is up to you to decide, but allow us to convey our very clear intent and feeling that this book contains the information needed in an introductory course, and even though there is some humor and informality in our approach, nothing about the intent is anything other than serious.

AND NOW, ABOUT THE FOURTH EDITION Any book is always a work in progress, and this latest edition of Tests & Measurement for People Who (Think They) Hate Tests & Measurement is no exception. The third edition was published several years ago, and many people told Neil and the publisher how helpful this book has been; others told us how they would like it to change and why. In revising this book, Bruce is trying to meet the needs of all audiences. Some things remain the same, but much has indeed changed. When a textbook is revised, we look for new topics that should be covered and even old ones that have become more familiar and newly popular. In any case, there’s always much more to learn and Bruce has tried to select topics that fit. The biggest changes in the new edition are as follows. • An entirely new chapter on how social science researchers can develop their own instruments for their own research—Surveys and Scale Development. • A mostly brand-new chapter on performance-based assessment in the classroom, Building the Right Answer. • A mostly brand-new chapter on test bias and other issues of Equity, Truth and Justice for All. • A heavily revised and expanded chapter on multiple-choice questions and other selection items in classroom assessment, Picking the Right Answer. • A new organizational plan to let you more easily focus on what you want. The 16 chapters are grouped into five sections—The Basics, Types of Tests, Classroom Assessment, Researcher-Made Instruments, and Fair Testing— and two appendices feature a giant list of popular commercially produced tests and assessments, and the answers to all the end-of-chapter questions and exercises. • Key statistical concepts and methods are no longer stuck in the back in an appendix but are now included in a basic statistics chapter and throughout the text where needed. • The foundational ideas of validity and reliability are in almost every chapter in their own focused sections.

xviii   Tests & Measurement for People Who (Think They) Hate Tests & Measurement

• An expanded discussion in the Item Response Theory chapter, comparing Item Response Theory concepts with Classical Test Theory thinking. • As the new coauthor, I’ve added my own voice and ideas throughout this new edition. • We’ve freshened up everything—new examples, updated information, more relevant references, and triple-checked that what we say is still accurate. • And, of course, any errors we (and you) discovered in the last edition have been fixed! Any typos and such that appear in this edition of the book are entirely my fault, and I apologize to you and your students who are inconvenienced by their appearance. As we discover any mistakes, you can see them at https://study.sagepub.com/salkindtm4e. When all is said and done, this edition is more than 50% new, and I truly hope the changes are improvements and increase the value of the book for you and your teaching!

DIGITAL RESOURCES A companion website for the book at https://edge.sagepub.com/salkindtm4e includes a variety of instructor resources: • Test bank • Editable PowerPoint slides • Tables and figures from the book Good luck! Bruce B. Frey University of Kansas [email protected]

ACKNOWLEDGMENTS C. Deborah Laughton was the first editor on the Who Think They Hate books, and Lisa Cuevas Shaw very competently took over when C. Deborah left SAGE. Neil owes more than words can express to them both. When Vicki Knight became editor, the professional treatment of Neil as an author, and the book as a product of great interest, was evident. Very fortunately for all of us, this continued with the new editor, Helen Salmon. Bruce wants to express a great deal of thanks to Helen for shepherding him into this amazing Who Think They Hate series and her excellent guidance. Successful books are of course about good content and good production and good marketing, but they are most about good relationships between authors and editors. We have been very fortunate. Thanks also to Astha Jaiswal, who directed the production of this book. It looks beautiful. Bruce also is happy to have worked with copy editor Taryn Bigelow. She is super smart and was very patient dealing with Bruce’s somewhat eccentric sense of humor. 😊 Thanks also to others at SAGE, including Katie Ancheta, and Chelsea Neve. SAGE and I gratefully acknowledge the following reviewers for their contributions: Keith F. Donohue, North Dakota State University; Roseanne L. Flores, Hunter College of the City University of New York; Stacy Hughey Surman, University of Alabama; Thomas G. Kinsey, Northcentral University; Steven Pulos, University of Northern Colorado; Edward Schultz, Midwestern State University; Cheryl Stenmark, Angelo State University; and Warren J. White, Kansas State University.

xix

ABOUT THE AUTHORS Neil J. Salkind received his PhD in human development from the University of Maryland, and after teaching for 35 years at the University of Kansas, he was Professor Emeritus in the Department of Psychology and Research in Education, where he collaborated with colleagues and worked with students. His early interests were in the area of children’s cognitive development, and after research in the areas of cognitive style and (what was then known as) hyperactivity, he was a postdoctoral fellow at the University of North Carolina’s Bush Center for Child and Family Policy. His work then changed direction to focus on child and family policy, specifically the impact of alternative forms of public support on various child and family outcomes. He delivered more than 150 professional papers and presentations, wrote more than 100 trade and textbooks, and is the author of Statistics for People Who (Think They) Hate Statistics, Theories of Human Development, and Exploring Research. He edited several encyclopedias, including the Encyclopedia of Human Development, the Encyclopedia of Measurement and Statistics, and the Encyclopedia of Research Design. He was editor of Child Development Abstracts and Bibliography for 13 years. He lived in Lawrence, Kansas, where he liked to read, swim with the River City Sharks, work as the proprietor and sole employee of big boy press, bake brownies, and poke around old Volvos and old houses. He died in 2017 at the age of 70. Bruce B. Frey, PhD, is an award-winning researcher, teacher, and professor of educational psychology at the University of Kansas. He is the editor of The SAGE Encyclopedia of Educational Research, Measurement, and Evaluation and author of There’s a Stat for That!, Modern Classroom Assessment, and 100 Questions (and Answers) About Tests and Measurement. He is the associate editor of SAGE’s Encyclopedia of Research Design. He also wrote Statistics Hacks for O’Reilly Media. His primary research interests include classroom assessment, instrument development, and program evaluation. In his spare time, Bruce leads a secret life as Professor Bubblegum, host of Echo Valley, a podcast that celebrates bubblegum pop music of the late 1960s. The show is wildly popular with the young people.

xxi

“A

nd in the beginning, there was . . . a test.” Not really, but, in a little bit, you will be surprised to learn just how long ago this whole measurement thing started. That’s what the first part of Tests & Measurement for People Who (Think They) Hate Tests & Measurement is all about—a little history, an introduction to what kinds of tests there are and what they are used for, and then something about how to use this book.

PART I

THE BASICS

You’re probably new to this, and we’re sure you couldn’t wait for this course to begin. J Well, it’s here now, and believe it or not, there’s a lot to learn that can be instructive and even fun—and immeasurably valuable. Let’s get to it.

1

1 WHY MEASUREMENT? An Introduction Difficulty Index ☺ ☺ ☺ ☺ ☺ (the easiest chapter in the book)

LEARNING OBJECTIVES After reading this chapter, you should be able to • List the important events and milestones in the history of tests and measurement. • Describe the role of testing and social science measurement in our lives and in academia. • Identify important caveats and issues when studying tests and measurement. • Give the steps in the developmental process for a good test. • Defend the need for a stand-alone course in tests and measurement. • Explain how to use this book and what all those cute smiley faces mean.

I

t’s been happening to you, and you’ve been doing it, since you were very young. Your whole life, you’ve been tested.

3

4  Part I 

■ 

The Basics

And it started even before you were born. While she was pregnant, your mom probably had the doctor assess you using ultrasound technology to measure your organs and evaluate whether your development was “normal.” (From your very first test-taking experience, measurement experts have insisted on comparing you to other people!) When you were born, the doctor administered the APGAR to assess your Appearance (or color), Pulse (or heart rate), Grimace (or response to stimulation), Activity (or muscle tone), and Respiration (or breathing). You were also screened (and it’s the law in almost every state) for certain types of metabolic disorders (such as PKU, or phenylketonuria). Then there may have been personality tests (see Chapter 10), spelling tests (see Chapter 12), statewide tests of educational progress (see Chapter 7), the ACT (American College Test) or the SAT (which actually is not an acronym—see Chapter 8 for more on this), and maybe even the GRE (Graduate Record Exam). Along the way, you might have received some career counseling using the SVIB (Strong Vocational Interest Blank) and perhaps a personality test or two, such as the MMPI (Minnesota Multiphasic Personality Inventory) or the MBTI (MyersBriggs Type Inventory). My, that’s a lot of testing, and you’re nowhere near done. You’ve still probably got a test or two to complete once you graduate from school, perhaps as part of a job application, for additional studies, or for screening for a highly sensitive job as a secret agent. And we haven’t even begun to list all the medical tests you have and will have been subject to in your lifetime! Testing is ubiquitous in our society, and you can’t pick up a copy of the New York Times, Chicago Tribune, or Los Angeles Times without finding an article about testing and some associated controversy. The purpose of Tests & Measurement for People Who (Think They) Hate Tests & Measurement is to provide an overview of the many different facets of testing, including a definition of what tests and measurement is as a discipline and why it is important to study, the design of tests; the use of tests, and some of the basic social, political, and legal issues that the process of testing involves. And when we use the word test, we are referring to any type of assessment tool, assessing a multitude of behaviors or outcomes. This first part of Tests & Measurement for People Who (Think They) Hate Tests & Measurement will familiarize you with a basic history of testing and what the major topics are that we as teachers, nurses, social workers, psychologists, parents, and human resource managers need to understand to best negotiate our way through the maze of assessment that is a personal and professional part of our lives. Let’s start at the beginning and take a brief look at what we know about the practice of testing and how we got to where we are.

Chapter 1 

■ Why

A FIVE-MINUTE HISTORY OF TESTING First, you can follow all this history stuff by using the cool time line for what happened when, beginning at the bottom of this page and appearing throughout the chapter. Here’s a summary. Imagine this. It’s about 2200 years BCE (Before the Common Era), and you’re a young citizen living in a large city in China looking for work. You get up, have some breakfast, walk over to the local “testing bureau,” and sit down and take a test for what we now know as a civil service position (such as a mail carrier). And at that time, you had to be proficient in such things as writing, arithmetic, horsemanship, and even archery to be considered for such a position. Must have been an interesting mail route. Yep—testing in one form or another started that long ago, and for almost 3,000 years in China, this open (anyone could participate), competitive (only the best got the job) system proved to be the model for later systems of evaluating and placing individuals (such as the American and British civil service systems that started around 1889 and 1830, respectively). Interestingly, this system of selection was abandoned in China around the turn of the 20th century, but we know from our own experience that the use of testing for all different purposes has grown rapidly. Testing is on the increase by leaps and bounds and it’s not getting any cheaper. The Brookings Institution think tank has estimated that states spend $1.7 billion yearly on just those federally mandated state tests for the K–12 crowd. That’s a ton of money, and the entire endeavor is expected to get even more expensive as tests and their use in school accountability continue to grow in popularity among our leaders. In terms of important doings related to tests, not much of a formal or recorded nature occurred before the middle of the 19th century, but by about the end of the 19th century, along comes our friend Charles Darwin, whom you may know from some of your other classes as the author of On the Origin of Species (available in the first edition, very good shape, for only about $440,000; but the shipping is free). This book (of which only 11 copies of the first edition have survived) is a groundbreaking work that stressed the importance of what Darwin called “descent with modification” (which we now call evolution). His thesis was that through the process of variation, certain traits and attributes are selected (that is, they survive while others die out), and these traits or attributes are passed on from generation to generation as organisms adapt. So why are we talking about Charles Darwin and biology in a tests and measurement book? Two reasons.

Measurement?  5

6  Part I 

■ 

The Basics

First, Darwin’s work led to an increased interest in and emphasis on individual differences—and that’s what most tests examine. And second, Darwin’s cousin (how’s that for a transition?) Francis Galton was the first person to devise a set of tools for assessing individual differences in his anthropometric (measurements of the human body) lab where one could have all kinds of variables measured, such as height, weight, strength, and even how steady you can hold your hands. His motto was, “Wherever you can, count.” (And, by the way, Sherlock Holmes’s motto was “Data! Data! Data!” They must have been very busy guys.) Once physical measurements were being made regularly, it was not long before such noted psychologists as James Cattell were working on the first “mental test.” Cattell was a founder of the Psychological Corporation in the early 1920s, now known as one of the leading publishers of tests throughout the world. When we get to the 20th century, testing and measurement activity really picks up. There was a huge increase in interest devoted to mental tests, which became known as intelligence tests (and, less accurately, IQ tests) and also included the testing of cognitive abilities such as memory and comprehension. More about this in Chapter 9. A major event in the history of testing occurred around 1905, when Alfred Binet (who was then the minister of public instruction in Paris) started applying some of these new tools to the assessment of Parisian schoolchildren who were not performing as well as expected. Along with his partner, Theodore Simon, Binet used tests of intelligence in a variety of settings—and for different purposes—beyond just evaluating schoolchildren’s abilities. Their work came to the United States in about 1916 and was extended by Lewis Terman at Stanford University, which is why this still popular intelligence test is called the Stanford-Binet. (As we write these words, by the way, Bruce is listening to Alvin and the Chipmunks, those adorable hit record makers from the 1950s and 1960s and animated movie stars from the 2000s. It occurs to us that the three chipmunks are named Alvin [kind of like Alfred], Theodore, and Simon. Wonder if their creator, Ross Bagdasarian, was a historian of intelligence tests.) As always, necessity is the mother and father of invention, and come World War II, there was a huge increase in the need to test and classify accurately those thousands of (primarily) men who were to join the armed services. (This occurred around World War I as well but with nowhere near the same amount of scientific deliberation.) Intelligence was one of the primary traits of interest on these tests and a strong correlation between scores on the military’s intelligence test and the eventual rank of test takers was early evidence used to argue for the accuracy of intelligence tests. And, as always, intense efforts at development within the government usually spill over to civilian life, and after World War II, hundreds of different types of tests

Chapter 1 

■ Why

were available for use in the civilian sector and made their way into hospitals, schools, and businesses. Indeed, we have come a long way from spelling tests. While all these mental and ability tests were being developed, increased attention was also being paid to other dimensions of psychological functioning, such as personality development (see Chapter 10). People might be smart (or not smart), but psychologists also wanted to know how well adjusted they were and whether they were emotionally mature enough to assume certain important responsibilities. Hence, the field of personality testing got started in earnest (around World War I) and certainly is now a major component of the whole field of tests and measurement. But our brief history of testing does not stop with intelligence or personality testing. As education became more important, so did evaluating achievement (see Chapter 7). For example, in 1937, the then-called Stanford Achievement Tests (or SATs) became required for admission to Ivy League schools (places such as Brown, Yale, and Princeton)—with more than 2,000 high school seniors taking the exam. Another example? In 1948, the Educational Testing Service (known as ETS) opened, almost solely to emphasize the assessment of areas other than intelligence. These are the folks that bring you today’s SAT, GRE, and the always popular and lovable Test of English as a Foreign Language (or TOEFL)—all taken by millions of students each year. It’s no wonder that services offering (and sometimes guaranteeing) testing success began to proliferate around 1945 with Stanley Kaplan. A very smart New Yorker (who was denied admission to medical school), Kaplan started tutoring students in the basement of his home for $0.25 per hour. His success (and it’s still a hotly debated issue whether raising test scores after instruction really indicates an increase in actual ability or knowledge) led him to create an empire of test centers (sold off for a bunch of millions to a big test company) that is still successful today. During the COVID-19 pandemic, it became difficult to take college admissions tests like the SAT, ACT, and GRE. As a result, many college programs and whole universities stopped requiring these tests when applying for admission. This changed the way that admissions officers and faculty thought about criteria for admission and predictors of success in college. Many had, for years, questioned the usefulness and fairness of these tests to decide who gets into college, especially the top programs, so there have always been reasons beyond the barriers imposed by COVID-19 to not place a lot of weight on standardized test scores. The question is whether these colleges will ever return to requiring test scores for admission decisions. It looks as if many will not and the role of these tests may have been permanently reduced. What will be the effect of that? The most common assessment that goes on in education, though, is not through standardized tests to predict college performance or demonstrate school effectiveness, it is the use of good old-fashioned classroom tests made by classroom teachers

Measurement?  7

8  Part I 

■ 

The Basics

for their own students. It turns out that there are research-based best practices for making a good test and Chapters 12 and 13 talk about how to write a good quiz or create a scoring rubric that will help to make the grades students get for giving a speech in class less subjective. Today, thousands and thousands of tests (and hundreds of test publishers—see Appendix A) measure everything from Advanced Placement Examination in Studio Art, which is designed to measure college-level achievements in studio arts, to the Health Problems Checklist, which is used to assess the health status and potential health problems of clients in psychotherapy settings. And a new emphasis on the study of neuroscience has led to new evaluative efforts that explore and assess the impact of brain behavior on performance and enable an intense look at the role and function of testing—not without a great deal of controversy about topics such as online testing, fair testing using a common core as the basis for educational valuation, high-stakes testing, test bias, and more.

SO, WHY TESTS AND MEASUREMENT? This question has a pretty simple answer, but simple does not mean lacking in complexity or significant implications. No matter what profession we enter, be it teaching, social work, nursing, or any one of thousands more, we are required to make judgments every day, every hour, and in some cases, every few minutes about our work. We do it so often that it becomes second nature. We even do it automatically. In the most straightforward of terms, we use a test (be it formal or informal) to measure an outcome and make sense of that judgment. And because we are smart, we want to be able to communicate that information to others. So if we find that Deion got 100% on a spelling test or a 34 on his ACTs, we want everyone who looks at that score to know exactly what it means. For example, consider the teacher who records a child’s poor grade in math and sends home some remedial work that same evening, the nurse who sees a patient shivering and takes their temperature, or the licensed clinical social worker who recognizes a client has significant difficulties concentrating and administers a test to evaluate their ability to stay on task and designs an intervention based on that evaluation. These people all recognize a symptom of something that has to be looked into further, and they take appropriate action. What all these professionals have in common is that in order for them to take action to help the people with whom they work, they need to first assess a particular behavior or set of behaviors. And to make that assessment, they use some kind of test (such as a standardized test in the case of the nurse or home-made

Chapter 1 

■ Why

test, as in the teacher’s case) to gather information. Then, based on their training and experience, they use that information to make a decision as to what course of action to take. For our purposes here, we are going to define a test as a (pick any of the following) tool, procedure, device, examination, investigation, assessment, or measure of an outcome (which is usually some kind of behavior, even if the “behavior” is getting a certain score). A test can take the form of a 50-question multiple-choice history exam or a 30-minute interview of parents on their relationships with their children. It can be a set of tasks that examine how good someone is at fitting together blocks into particular designs or an attitude survey about whether they prefer multigrain Cheerios to plain Cheerios. (The right answer should be plain.) We use tests that come in many different forms to measure many different things.

What We Test We test many, many different things, and the thousands of tests that are available today cover a wide range of areas. Here’s a quick review of some of the content areas that tests cover. We’ll go into greater detail on each of these in Part II of Tests & Measurement for People Who (Think They) Hate Tests & Measurement. We’ll define these different general areas here, and in Table 1.1 you can see a summary along with some real-world examples. Achievement tests (covered in Chapter 7) assess an individual’s level of knowledge in a particular domain, like in school. For example, your midterm in history was an achievement test. As Bruce’s grandfather would say, achievement tests measure “book learnin’.” Personality tests (covered in Chapter 10) assess an individual’s unique and stable set of characteristics, traits, or attitudes. You may have taken an inventory that determined your level of introversion or extraversion. In naming Chapter 9, we use the term psychological tests, which is broader and allows us to include other measures that aren’t technically about personality. Aptitude tests (covered in Chapter 8) measure an individual’s potential to succeed in an activity requiring a particular skill or set of skills. For example, you may take an aptitude test that assesses your potential for being a successful salesperson. Aptitude tests predict the future. Ability or intelligence tests (covered in Chapter 9) assess one’s level of skill or competence in a wide variety of areas. For example, intelligence tests are viewed as measures of ability (but don’t be fooled by the name of a test—there are plenty of intelligence tests that are also seen as being aptitude tests—see the upcoming box!). Neuropsychological tests (covered in Chapter 10) assess the functioning of the brain as it relates to everyday behaviors, including emotions and thinking.

Measurement?  9

10  Part I 

■ 

TABLE 1.1 

The Basics

 An Overview of What We Test and Some Examples of Such Tests

Type of Test

What It Measures

Some Examples

Achievement

Level of knowledge in a particular domain

••

Closed High School Placement Test

••

Early Childhood Assessment

••

Norris Educational Achievement Test

••

Test of Adult Basic Education

••

Achievement Motivation Profile

••

Aggression Questionnaire

••

Basic Living Skills Scale

••

Dissociative Features Profile

••

Inventory of Positive Thinking Traits

••

Differential aptitude tests

••

Scholastic Aptitude Scale

••

Aptitude Interest Category

••

Evaluation Aptitude Test

••

Wilson Driver Selection Test

••

Wechsler Intelligence Scale for Children

••

Stanford-Binet Intelligence Scales

••

Cognitive Abilities Test

••

General clerical ability tests

••

School Readiness Test

••

Boston Naming Test

••

Cognitive Symptoms Checklist

••

d2 Test of Attention

••

Kaplan Baycrest Neurocognitive Assessment

••

Ruff Figural Fluency Test

••

Adaptive Functioning Index

••

Career Interest Inventory

••

Prevocational Assessment Screen

••

Rothwell Miller Interest Blank

••

Vocational Adaptation Rating Scales

Personality

Unique and stable set of characteristics, traits, or attitudes

Aptitude

Potential to succeed

Ability or intelligence

Skill or competence

Neuropsychological

Vocational or career

How your brain works

Job-related interests

Note: You can find out more about many of these tests by going to the Buros Center for Testing at buros.org.

Finally, vocational or career tests (covered in Chapter 11) assess an individual’s interests and help classify those interests as they relate to particular jobs and careers. For example, you may have taken a vocational test that evaluates your level of interest in the health care professions or the culinary arts (or both, which is maybe how we got the name Dr. Pepper?).

Chapter 1 

■ Why

There is always a great deal of overlap in the way people categorize particular types of tests and what they assess. For example, some people consider intelligence to be an ability (and would place it under ability tests), whereas others think of it as an achievement test because one aspect of intelligence is the ability to learn or retain information. Or aptitude tests can end up as ability tests as well as personality tests, or they can stand all on their own. And think of college admissions tests, like the ACT or SAT. They ask questions that measure learned information (which sounds like achievement), but they are used to predict success in college (which sounds like aptitude). So what’s right? They are all right. The way we classify tests is strictly a matter of organization and convenience and even a matter of how they are used. The definitions and examples given here reflect the current thinking about tests and measurement. Others feel differently. Welcome to the real world.

Why We Test Now you know that there are different forms of tests and that there are many different areas of human performance and behavior that are tested regularly. But for what purpose? Here’s a summary of the five main reasons we measure people (and there are surely more). Tests are used for selection. Not everyone can be a jet pilot, so only those humans (and some smart monkeys) who score at a certain level of performance on physical and psychological assessments will be selected for training. Tests are used for placement. Upon entering college, not everyone should be in the most advanced math class or in the most basic. A placement test can determine where the individual belongs. Tests are used for diagnosis. An adult might seek out psychological counseling, and the psychologist may administer a test or group of tests that helps diagnose any one of many different mental disorders. Diagnostic tests are also used to identify individual strengths. Tests are used to classify. Want to know what profession might suit you best? One of several different tests can provide you with an idea of your aptitude (or future potential) for a career in the culinary arts, auto mechanics, medicine, or child care. Finally, tests are used in research. To find relationships among variables in the social sciences, tests are often used to assign scores—meaningful quantities—to those variables and then statistical analyses are conducted on those scores. So, what are tests used for? Tests are used widely for a variety of purposes, among them selection, placement, diagnosis, classification, and for measuring research variables.

Measurement?  11

12  Part I 

■ 

The Basics

What Makes a Good Test Good? Regardless of the purpose of a test, there are two characteristics of quality. A good test is valid and a good test is reliable. We will talk about these concepts throughout this book, but, for now, think of validity as the characteristic of a test that measures what it is supposed to; it works as intended. Reliability refers to a test that produces a score that does not vary randomly; it produces scores that represent typical performance for each person who takes it.

SOME IMPORTANT REMINDERS You’ll learn many different things throughout Tests & Measurement for People Who (Think They) Hate Tests & Measurement (at least we sure hope you will). And with any vibrant and changing discipline, there are always discussions both pro and con about different aspects of the subject. But there are some constants as well, as presented in the following: • Some behaviors can be observed more closely and more precisely than others. It’s pretty easy to measure one’s ability to add single digits (such as 6 + 5 = ?), but to understand how one solves (not if one can solve) a simple equation is a different story. The less obvious behaviors take a bit more ingenuity to measure, but that’s part of the challenge (and delight) of doing this. • Our understanding of behavior is only as good as the tools we use to measure it. There are all kinds of ways we try to measure outcomes, and sometimes we use the very best instruments available—and at other times, we may just use what’s convenient. The development and use of the best tools takes more time, work, and money, but it gives us more accurate and reliable results. Anything short of the best forces us to compromise, and what you see may, indeed, not be what you get. • Tests and measurement tools can take many different forms. A test can be paper and pencil, computer administered, self-report, observation, performance based, and so on, but often different forms give us very similar information on some outcome in which we are interested. And often, the format a test uses is determined by what it is measuring. For example, most classroom achievement tests are paper and pencil, and most tests that look at performance of motor skills are performance based. The lesson here is to select the form of test that best fits the question you are asking. • The results of any test should always be interpreted within the context in which it was collected. In many communities, selected middle school students take a practice SAT test. Although some of these students do very, very

Chapter 1 

■ Why

well, others perform far below what you would expect a high school junior or senior to do; this makes sense because these younger children simply have not yet had the chance to learn the material. To interpret the results of the younger children using the same metric and scoring standards as for the older children would surely not do either group any justice. The point is to keep test scores in perspective—and of course, to understand them within the initial purpose for the testing. • Test results often can be misused. It doesn’t take a rocket scientist to know that there have been significant controversies over how tests are used. You’ll learn more about this in Part V of Tests & Measurement for People Who (Think They) Hate Tests & Measurement. Did you know, for example, that many non-English-speaking immigrants who tried to get sanctuary in the United States were turned away in the 1930s as having low intellectual abilities based on test scores. Tests written in English! To use tests fairly and effectively, you need to know the purpose of the test, the quality of the test, how it is administered and used, and how the results are interpreted. We’ll explore these issues in Tests & Measurement for People Who (Think They) Hate Tests & Measurement. Remember, no matter how interesting your theory or approach to a problem, what you learn about behavior is only as accurate and worthwhile as the integrity and usefulness of the tools you use to measure that behavior.

HOW TESTS ARE CREATED We can suggest several books that are all about the theory and mechanics of test construction, and this is not one of them. So instead, we humbly offer this brief summary of how, in general, a test is designed and the steps in the process. Keep in mind that the process shown here is for standardized tests used for important decision making and, in some cases, for tests used for social science research. A smart teacher using a rubric to grade a book report likely skipped a couple of these steps. The entirety of the process shown in Table 1.2 is linear; that is, step 2 almost always follows step 1, but within each step there is some evaluation of whether it is time to move on to the next step or repeat a previous step (or even just start all over from scratch). Let’s take a look.

So What’s New? Up until the last few decades, the development of almost all tests fell within something called Classical Test Theory (or CTT). The CTT model (and most of this book discusses stuff in the context of that model) primarily looks to increase the accuracy of measuring a test taker’s typical score or true score, which is a theoretical

Measurement?  13

14  Part I 

■ 

The Basics

TABLE 1.2 

 A Broad Description of the Steps in the Development of a Standardized Test Step 1

Choose and define the idea, trait, or characteristic to be measured. Test designers call these abstract ideas constructs (pronounced CON-structs).



Step 2 Determine the best format or method to use (e.g., paper and pencil, performance based, survey, interview, and so on) and what should be covered on the test.



Step 3 Develop a large pool of possible questions or items.



Step 4 Pilot test the pool of possible questions and items. Gather data about the validity and reliability of items.



Step 5 Use data collected during initial pilot testing to revise or improve the items or write new items.



Step 6 Pilot test the revised item pool.



Step 7 Use data collected during the second pilot testing of items to make final choices about which items to use and, if needed, how to group them into scales (a group of items that all measure the same construct).



Step 8 Develop directions and guidelines for administration.



Step 9 Conduct final validity and reliability studies.



Step 10 Use data from the validity and reliability studies to make any final revisions.



Step 11 If needed for this test, conduct norming studies to find out what typical scores or levels of performance are in a population.



Step 12 Develop a test manual to guide people on how to administer the test and interpret test scores.

Chapter 1 

■ Why

average score a person would get if they took the test an infinite number of times. Because there will be some randomness in a person’s performance unrelated to the actual level of the trait or construct being measured, the score a person gets on a test (what measurement folks call the observed score) is unlikely to be the true score, but might be close to it. The closer that observed score is to the person’s true score, the more reliable the test is. One alternative to Classical Test Theory is Item Response Theory (IRT), which places the emphasis not on the individual’s performance and the various sources of random error in the testing situation but instead focuses on the level of reliability in the items themselves. It recognizes that the functioning of an item depends partly on the item and partly on the characteristics of the test taker. For example, on an achievement test, a question will vary in difficulty (and, therefore, in reliability) depending on how much the test taker knows. What’s hard for me might not be hard for you. We’ll distinguish between CTT and IRT (as well as some other new approaches) in Chapter 6. All you need to know for now is that, as in almost all disciplines, new ideas and techniques are always being developed, almost always interesting, and surely always ripe for discussion and friendly differences among experts, colleagues, and students as to what’s best.

WHAT AM I DOING IN A TESTS AND MEASUREMENT CLASS? There are probably many reasons why you find yourself using this book. You might be enrolled in an introductory tests and measurement class. You might be reviewing for your comprehensive exams. Or you might even be reading this on summer vacation (horrors!) in preparation and review for a more advanced class. In any case, you’re a tests and measurement student whether you have to take a final exam at the end of a formal course or whether you’re just in it of your own accord. But there are plenty of good reasons to be studying this material—some fun, some serious, and some both. Here’s a list of some of the things our students hear when we teach measurement courses. And your instructor might say similar things. • Tests and Measurement 101 or Introduction to Testing or whatever it’s called at your school looks great listed on your transcript. Kidding aside, this may be a required course for you to complete your major. But even if it is not, having these skills is definitely a big plus when it comes time to apply for a job or for further schooling. And with more advanced courses, your résumé will be even more impressive.

Measurement?  15

16  Part I 

■ 

The Basics

• If this is not a required course, taking a basic tests and measurement course sets you apart from those who do not. It shows that you are willing to undertake a course that is (traditionally) above average in regard to difficulty and commitment. • Basic information about tests and measurement is an intellectual challenge of a kind that you might not be used to. A good deal of thinking is required, as well as some integration of ideas and application. The bottom line is that all this activity adds up to what can be an invigorating intellectual experience, because you learn about a whole new area or discipline. Imagine trying to see something that is invisible. That’s what people who use tests are trying to do! • There’s no question that having some background in tests and measurement makes you a better student in the social, behavioral, and health sciences. Once you have mastered this material, you will have a better understanding of what you read in journals and also what your professors and colleagues may be discussing and doing in and out of class. You will be amazed the first time you say to yourself, “Wow, I actually understand what they’re talking about.” And it will happen over and over again, because you will have the basic tools necessary to understand exactly how scientists reach the conclusions they do. • If you plan to pursue a graduate degree in education, anthropology, economics, nursing, medicine, sociology, or any one of many social, behavioral, and health sciences fields, this course will give you the foundation you need to move further. • Finally, you can brag that you completed a course that everyone thinks is the equivalent of building and running a nuclear reactor. (In which Bruce proudly got a C–. He would have done better, but he always pronounced nuclear “new-KYUH-ler.”)

TEN WAYS NOT TO HATE THIS BOOK (AND LEARN ABOUT TESTS AND MEASUREMENT AT THE SAME TIME!) Yep. Just what the world needs—another tests and measurement book. But this one is different (we think, and others have told us so). It is written for you, it is not condescending, it is informative, and it is as simple as possible in its presentation. It assumes you have only the most basic information about testing and the math of measurement— remember mean, median, and mode? Well, that’s where the math begins. There has always been a general aura surrounding the study of tests and measurement that it’s a difficult subject to master. And we don’t say otherwise, because

Chapter 1 

■ Why

parts of it are challenging. On the other hand, millions and millions of students just like you have mastered this topic, and you can, too. Here are a few hints to close this introductory chapter before we move on to our first topic.  1. You’re not dumb. That’s true. If you were, you would not have gotten this far in school. So treat tests and measurement like any other new course. Attend the lectures, study the material, and do the exercises in the book and from class, and you’ll do fine. Rocket scientists know how to use this stuff, but you don’t have to be a rocket scientist to succeed.  2. How do you know tests and measurement is hard? Is this topic difficult? Yes and no. If you listen to friends who have taken the course and didn’t work hard and didn’t do well, they’ll surely volunteer to tell you how hard it was and how much of a disaster it made of their entire semester, if not their lives. And let’s not forget—we always tend to hear from complainers. So I suggest that you start this course with the attitude that you’ll wait and see how it is and judge the experience for yourself. Better yet, talk to several people who have had the class and get a good general idea of what they think. Just don’t base your opinion on one spoilsport’s experience.  3. Form a study group. This is one of the most basic ways to ensure some success in this course. Early in the semester, arrange to study with friends. If you don’t have any who are in the same class as you, then make some new ones or offer to study with someone who looks to be as happy about being there as you are. Studying with others allows you to help them if you know the material better or to benefit from others who know the material better than you do. Set a specific time each week to get together for an hour and go over the exercises at the end of the chapter or ask questions of one another. Take as much time as you need. Find a coffee shop and go there with your study buddy. Studying with others is an invaluable way to help you understand and master the material in this course.  4. Stay on task and take one thing at a time. Material about testing and measurement can be tough to understand, especially if you have never heard any of these terms before or thought about any of these ideas. Follow the guidelines mentioned here and talk with your teacher as soon as you find yourself not understanding something or falling behind.  5. Ask your teacher questions, and then ask a friend. If you do not understand what you are being taught in class, ask your professor to clarify it. Have no doubt—if you don’t understand the material, then you can be sure that others do not as well. More often than not, instructors welcome questions. And especially because you’ve read the material before class, your questions should be well informed and help everyone in class better understand the material.

Measurement?  17

18  Part I 

■ 

The Basics

6. Do the exercises at the end of a chapter. The exercises are based on the material and the examples in the chapter they follow. They are there to help you apply the concepts that were taught in the chapter and build your confidence at the same time. If you can answer these end-of-chapter exercises, then you are well on your way to mastering the content of the chapter.  7. Yes, it’s a very old joke: Q.  How do you get to Carnegie Hall? A. Practice. Well it’s no different with basic statistics. . . .] Well it's no different with basic statistics. You have to use what you learn and use it frequently to master the different ideas and techniques. This means doing the exercises in the back of the chapter as well as taking advantage of any other opportunities you have to understand what you have learned.  8. Look for applications to make it more real. In your other classes, you probably have occasion to read journal articles, talk about the results of research, and generally discuss the importance of the scientific method in your own area of study. These are all opportunities to look and see how your study of tests and measurement can help you better understand the topics under class discussion as well as the area of beginning statistics. The more you apply these new ideas, the better and more full your understanding will be.  9. Have fun. This indeed might seem like a strange thing for you to read, but it all boils down to you mastering this topic rather than letting the course and its demands master you. Set up a study schedule and follow it, ask questions in class, and consider this intellectual exercise to be one of growth. Mastering new material is always exciting and satisfying; it’s part of the human spirit. You can experience the same satisfaction here. Just keep your eye on the ball and make the necessary commitment to stay current with the assignments and work hard. 10. Finally, be easy on yourself. This is not material that any introductory student masters in a matter of hours or days. It takes some thinking and some hard work, and your expectations should be realistic. Expect to succeed in the course, and you will. Every now and then, but not often, you’ll find steps like the ones you see here. This indicates that there is a set of steps coming up that will direct you through a

Chapter 1 

■ Why

Measurement?  19

particular process. These steps have been tested and approved by the federal agency that oversees staircases, escalators, and small step stools.

The Famous Difficulty Index For want of a better way to give you some up-front idea about the difficulty of the chapter you are about to read, we have developed a highly secret difficulty index using smileys. (Harvey Ross Ball claimed to have invented smiley faces in 1963, by the way. Please don’t tell him we are using them. We cannot handle a lawsuit right now.) This secret code lets you know what to expect as you begin reading. Remember, the MORE smiling faces the EASIER the material! How Hard Is This Chapter?

Look at Mr. Smiley!

Very hard



Hard Not too hard, but not easy either Easy Very easy

☺☺

☺☺☺

☺☺☺☺

☺☺☺☺☺

GLOSSARY Bolded terms in the text are included in the glossary at the back of the book.

Summary Now you have some idea about what a test is and what it does, what areas of human behavior are tested, and even the names of a few tests you can throw around at tonight’s dinner table. But most of all, we introduced you to a few of the major content areas we will be focusing on throughout Tests & Measurement for People Who (Think They) Hate Tests & Measurement.

Time to Practice 1.

What are some of your memories of being tested? Be sure to include (if you can) the nature of the test itself, the settings under which the test took place, how prepared or unprepared you felt, and your response upon finding out your score.

20  Part I 

■ 

The Basics

2. Go to the library (virtually or in person) and identify five journal articles in your area of specialization, such as teaching math or nursing or social work. Now create a chart like this for each set of five. Journal Name

Title of Article

What Was Tested

What Test Was Used to Test It?

1. 2. 3. 4. 5.

a. Were most of the tests used developed commercially, or were they developed just for this study? b. Which test do you think is the most interesting, and why? c. Which test do you think got the closest to the behavior that the authors wanted to measure? 3. Ask your parent, child, professor, colleague, or classmate what they believe are the most important reasons for testing and what types of tests they can identify. 4. One of the things we did in this opening chapter was identify five different purposes of tests (see the section, “Why We Test”). Think of at least two other ways that tests might be used, and give a real-world example of each. 5. Interview someone who uses tests in their work, as either an assessment or a research tool, and try to get an idea of the importance they place on being knowledgeable about testing and what role it plays in their research and everyday professional career. Are they convinced that tests assess behavior fairly? Do they use alternatives to traditional testing? Do they find the results of tests useful for helping students? 6. Extra credit and extra imagination: Use your favorite search engine and search on five different topics related to testing in general, such as fairness in testing, use of computerized testing, how tests are developed, and so on. Use your imagination and search as broadly as possible. Summarize the results of these searches and propose some directions you think testing might be taking in future activities.

2 LEVELS OF MEASUREMENT AND THEIR IMPORTANCE One Potato, Two Potato Difficulty Index ☺ ☺ ☺ (a bit harder than Chapter 1 but easily understood)

LEARNING OBJECTIVES After reading this chapter, you should be able to • Define the basic testing terms of variable and measurement. • Explain the four levels of measurement. • Compare and contrast the four levels of measurement. • Give examples of how researchers choose to measure variables using different levels of measurement.

H

ow things are measured is very important to our study of tests and measurement. And since measuring things basically means slapping a score on them, the nuts and bolts of the numbers that are used to score tests and how they come to be is especially important.

21

22  Part I 

■ 

The Basics

What test developers and social science researchers call levels of measurement refers to the way that numbers are assigned to represent variables. There are low levels of measurement, which provide very little information about how people differ from each other on some variable, and there are high levels of measurement, which tell a lot about how people differ from each other on some variable. A single variable can be measured using four different levels of measurement and the higher the level of measurement, the more powerful the types of analysis one can use. For example, in one study, researchers from the University of Kansas examined how the very interesting concepts of self-determination and self-concept have an effect on academic achievement for adolescents with learning disabilities. Using trusted and field-tested assessment tools, they found significant relationships among the three variables of self-determination, self-concept, and academic achievement, with self-determination being a useful predictor of academic achievement for these students. Because the variables were quantified using a high level of measurement, cool statistics like correlation coefficients and a sophisticated approach called multiple regression could be used. Otherwise, these relationships might have been completely missed. Want to know more? Go to Google Scholar and look up Zheng, C., Erickson, A., Kingston, N., & Noonan, P. (2014). The relationship among self-determination, self-concept, and academic achievement for students with learning disabilities. Journal of Learning Disabilities, 47, 462–474.

FIRST THINGS FIRST Before we start talking about levels of measurement, let’s spend a moment defining a few important terms—specifically, what a variable is and what the term measurement means. A variable is anything (such as a test score) that can take on more than one value. For example, a score on the SAT can take on more than one value (such as 750 or 420), as can what category of color your hair falls into (such as red or black). Age is another variable (someone can be 2 months old or 87 years old), as is favorite flavor of ice cream (such as rocky road or mint chocolate chip). Notice that the labels we apply to outcomes can be quantitative (such as 87 years) or qualitative (such as rocky road or mint chocolate chip). Variables can be represented by quantities or by different categories (which is what qualitative means). Good, that’s out of the way. Now, the term measurement means the assignment of labels to (you guessed it) a variable or an outcome. So when we apply the label “blue” to a particular outcome, we are measuring that outcome. We can measure the number of windows in a house, the color of a car, the score on a test of memory, and how fast someone can run 100 yards. In every case, we are assigning a label to an outcome. Sometimes

Chapter 2 

■ 

Levels of Measurement and Their Importance  

that label is precise (such as 10.7 seconds for the time it takes to run 100 yards), and sometimes it is less precise (such as “like” for how someone feels about a presidential candidate). There is a great deal of discussion (and of course controversy) over what some levels mean. For example, the biological sex of a child at birth is designated (measured) as male or female. But the gender label of that individual may differ from the biological designation, because a gender label is often socially or psychologically determined (male and female might still be used as categories but not be based on body parts or perhaps gender is on a continuum somewhere between male and female). These are interesting and provocative options for measurement folks that represent different theoretical constructs, yes, but also different levels of measurement. As the world turns, about 75 years ago, in 1946, S. S. Stevens, a famous experimental psychologist (not a steamship), started to wonder about how different types of variables were measured and whether the level of precision at which those variables were measured made a difference. He wanted to know whether a set of rules could be developed to categorize variables based on the how their scores were assigned. And coming in at about 10 pounds 11 ounces—the idea of levels of measurement was born. Or to be more precise, 171 ounces. Or to be less precise, about 11 pounds. Variables can be measured in different ways, and the way a variable is being measured determines the level of measurement being used. For example, we can measure the height of an individual in several different ways. If we say only that Bruce is taller than Neil, then we have chosen to measure this variable at a level that distinguishes one person from another only in a rank order sort of way—one person is less tall than another, but we don’t know by how much. On the other hand, if we chose to measure this variable in a way that differentiates people by a certain quantity of inches, that is much more precise and much more informative. It’s possible that Bruce is only barely taller than Neil (maybe just an inch), not a lot taller. Those are different interpretations of the same variable because two different levels of measurement were used.

THE FOUR HORSEMEN (OR LEVELS) OF MEASUREMENT A level of measurement represents how much information is being provided by the outcome measure. There are four levels of measurement—nominal, ordinal, interval, and ratio—and here’s more about each.

The Nominal Level of Measurement The nominal level of measurement describes a measurement system where numbers are used to identify different categories but not used as quantities. These

23

24  Part I 

■ 

The Basics

are variables that are categorical or discrete in nature. These “labels” (the scores) are qualitative in nature, and people (or objects or whatever) can be placed in one and only one category, which is why such scores are mutually exclusive. You can’t be in one reading group (the Guppies) and another reading group (the Sharks) at the same time. For example, Max is in the Guppies and Aya is in the Sharks. Imagine we use 1 to mean Guppies and 2 to mean Sharks. Knowing that Max’s score is a 1 and Aya’s score is a 2 tells us only that Max and Aya are in two different reading groups—not how the groups differ, or who is better, or how well Max or Aya can read—just that they are in two different groups. It counts as measurement because it differentiates people on a variable, but, gosh, it sure doesn’t tell us much by itself. It is barely better than no information at all. We call this level the nominal level of measurement (after the word nomin in Latin, meaning name), because the only distinction we can make is that variables differ in the category in which they are placed. Measuring a variable such as reading ability by group assignment (and nothing more) is similar to measuring political attitudes using the labels Republicans and Democrats; measuring income by seeing if people are Honda Civic or Tesla drivers; and measuring preferences for fresh fruit by asking if people shop at Kroger’s or the Piggly-Wiggly. They are “measured” only by the nature of the group to which they belong. (When writing this paragraph, Bruce paused to celebrate the first time in his life he had ever typed the phrase “piggly-wiggly.” Huzzah.) Want to know more about how these variables differ? Well, you can’t just by knowing in which category a person belongs, and that’s the big significant limitation with the nominal level of measurement. If you want more information, then you have to dig more or define (and measure) the variable in a more precise way, which we will do shortly. Take a second to really think about the limitation of using nominal levels of measurement in social science research. It’s a major weakness with this approach, right? But (and you might want to sit down for this) think about how often in science we compare two groups and look for differences! Like comparing an experimental group that received a new drug with a control group that did not on some health outcome (like blood pressure). That variable from which we create the two groups is measured at the nominal level! We might be missing a lot of potential discoveries if we measured all of our research variables at higher levels of measurement than the very imprecise nominal level. (In this example, imagine if everyone in the study got some slightly different amount of the new drug. Not all or nothing. There’s a lot more information in that variable.) Of course, sometimes variables should only be measured at the nominal level because that makes the most sense. An example of a study that used a nominal-level

Chapter 2 

■ 

Levels of Measurement and Their Importance  

variable is one conducted by Rik Verhaeghe and his colleagues, detailed in a 2003 article that appeared in Stress and Health: Journal of the International Society for the Investigation of Stress (you know, that magazine we all subscribe to). They examined job stress among middle-aged health care workers and its relation to absences due to sickness. One assessment that had to be made was based on the assignment of people to one of two groups—nurses or non-nurses—and that variable was measured at the nominal level. As you can see, a participant can be in only one group at a time (they are mutually exclusive), and for this variable (occupation) you can assign only a label (nurse or non-nurse). People don’t vary in the amount of nurse they are. Want to know more? Take a look at Verhaeghe, R., Mak, R., VanMaele, G., Kornitzer, M., & De Backer, G. (2003). Job stress among middle-aged health care workers and its relation to sickness absence. Stress and Health: Journal of the International Society for the Investigation of Stress, 19(5), 265–274. There’s always a ton of discussion about measurement levels and their utility, starting with the definition of variables and how those variables are measured. And in some cases, there’s doubt that the nominal level of measurement is a level of measurement at all rather than only a qualitative description of some outcome. So, if it doesn’t make sense to you to treat categorization as measurement just because numbers are used as the names of categories (e.g., 1 = nurse, 0 = not a nurse), that’s okay. But here are two things to remember about nominal-level measurement. 1. The categories in which measures can be placed on the nominal scale are always mutually exclusive. You can’t be in the red preschool room and the blue preschool room at the same time. 2. Nominal-level measures are always qualitative (the values are categories and not quantities). Being in the red room is neither here nor there, as is being in the blue room. You’re a preschooler in either room, and that’s it; the room assignment says nothing about anything other than that—just which room you are in. If, in your data, a 2 means blue and a 1 means red, blue isn’t twice as much as red just because a 2 is twice as much as 1 mathematically. (Note to editor: Please check our math here.) The numbers are only names.

The Ordinal Level of Measurement Next, we have the ordinal level of measurement, which describes how variables can be ordered along some type of continuum. (Get it? Ordinal, as in ordering a set of things.) So outcomes are placed in categories (like the nominal scale), but they also have an order or rank to them as well, like stronger and weaker, taller and shorter, faster and slower, and so on. For example, let’s take Max and Aya again. As it turns out, Max is a better reader than Aya. Right there is the one and only necessary criterion for a measure to be

25

26  Part I 

■ 

The Basics

at the ordinal level of measurement. It’s the “better than” or “worse than” thing— some expression of the relationship between categories. However, from better or worse, we cannot tell anything about how good a reader either Max or Aya is, because ordinal levels of measurement do not include this information. If we are measuring reading ability, and scores are assigned based on rank ordering all students in a classroom from best reader to worst reader (which is an ordinal-level approach), we can know if Max scores higher than Aya or lower than some other student, but we don’t know how good a reader he is or Aya is or any student is, just how they compare to each other. Max, Aya, and, heck, the whole class might be great readers (or might be not-so-great), but we don’t know by looking at scores using the ordinal level of measurement. Our real-world example of ordinal-level measurement is a study by Kathe Burkhardt and her colleagues that appeared in the journal Behavior Change. They examined common childhood fears in 9- to 13-year-old South African children, and one of the ways they assessed fears was by having children rank them. In fact, the researchers found that the children’s rankings of fears differed from rankings derived using a scale that attached an actual value to the fear. Want to know more? Take a look at Burkhardt, K., Loxton, H., & Muris, P. (2003). Fears and fearfulness in South African children. Behaviour Change, 20(2), 94–102.

The Interval Level of Measurement That’s two levels of measurement down and two to go. The interval level of measurement gives us a nice jump in the amount of information we obtain from a new level of measurement. (And, SPOILER ALERT, in real-world research the interval level of measurement is the most popular approach.) You already know that we can assign names (nominal level) and rank (ordinal level), but it is with the interval level of measurement that we can assign a value to an outcome that is based on some underlying continuum that has equal “intervals.” And if there is an underlying continuum, then we can make very definite statements about someone’s position (their score) along that continuum and their position relative to another person’s position, including statements about such things as differences. Wow, that’s a lot more complex than the earlier two levels of measurement and provides a lot more information as well. Scores at this level tell us that each score is different from others in some way (like nominal level does), the scores represent more and less of some quantity (like ordinal level does), but now the quantities are precise enough that the distance between any two adjacent scores on the scale are equal. For instance, on a Celsius thermometer, the difference between 48 degrees and 47 degrees is one degree of heat (whatever that means) and the difference between 34 degrees and 33 degrees is also one degree of heat. Everywhere along the scale, there is an equal amount of difference in the variable (heat) between any two scores that are side by side. In other words, a mathematical

Chapter 2 

■ 

Levels of Measurement and Their Importance  

difference between two scores is the same as the quantitative difference between units of the variable. To understand more about interval-level measurement, we look at the architectural origins of the word “interval.” Interval, or inter val, means between walls. It describes the top of towers and such on old castles. Notice those little stone protective barriers at the top of the towers in Figure 2.1? A well-designed fortress like a castle needed protection for the guards at the gate shooting arrows (or throwing rocks or whatever; what are we, medievalists?), but there also had to be openings to shoot those arrows at the attackers. And the best design had equal spacing between those barriers all the way around because you never know where they will be coming from. So those intervals have equal spacing, like an interval-level scale! Using our Max and Ava reading-ability example, not only do we know that Max and Ava fall into two different categories of readers and that Max is a better reader than Ava, but we can also now know how much better Max actually is. Let’s assume that reading ability was measured using a reading comprehension test. Imagine Max got 82 points out of 100 possible on the test and Ava got 42 points. Because one of the assumptions of this level of measurement is that it is based on a scale that has equally appearing intervals, we can say not only that Max got 40 points more than Ava (which is true mathematically) but also that Max has 40 points more of reading ability. Now, what that means in a theoretical sense is determined by the test developer or the researcher. Although an interval-level scale provides much more information than an ordinalor nominal-level scale, you have to be careful in how you interpret these scores. For example, scoring 50% higher on a history test does not mean that score represents 50% more knowledge (unless the test is a perfect, perfect, perfect representative of all the questions that could be asked). Rather, it means only that 50% more of the questions were answered correctly. We can conclude that the more questions correct, the better one is in history, but don’t carry it too far and overgeneralize from a test score to an entire area of knowledge or a construct such as intelligence.

FIGURE 2.1 

  “Intervals” on a Castle Gate

27

28  Part I 

■ 

The Basics

What’s the big advantage of the interval level of measurement over the nominal and ordinal besides increased precision? In one word, information—there’s much more of it when we know what a score actually is and what it means. Remember, Max could be ranked number 1 in his class but get only 50% of the questions correct on the reading comprehension test. On the other hand, knowing what his exact score is relative to some type of underlying continuum provides us with an abundance of information when it comes time to make a judgment about his performance. And, as we will see later in Chapter 5, once you are at the interval level, you can start to use all sorts of fancy statistics to understand the scores, like means, percentile ranks, and such. Shintaro Saro at Montclair State University and colleagues were interested in tourists who visit sports venues and the relationship between the perceived value of the experience and their loyalty. Measuring variables at the interval level allowed them to use advanced statistical analyses involving correlations. (Unless you are at the interval level of measurement, correlational analyses are much less powerful.) The researchers found that for experienced travelers, the emotional response related to the experience was the best predictor of planning to return again, but for novice travelers, it was perceptions of quality that was most important. Want to know more? Sato, S., Gipson, C., Todd, S., & Harada, M. (2018). The relationship between sport tourists’ perceived value and destination loyalty: An experience-use history segmentation approach. Journal of Sport & Tourism, 22(2), 173–186.

The Ratio Level of Measurement This is by far the most informative level of measurement, yet the one that is least likely to be seen in the social or behavioral sciences. Why? Because the ratio level of measurement is characteristic of all the other scales we have already talked about, but it also includes a very important assumption—an absolute zero corresponding to an absence of the trait or characteristic being measured. Physical measurements such as amount of rainfall, weight, and height fall under the ratio level of measurement. Or any time you count things—like the number of dogs on a porch. The scale and its use become interesting when we begin to look at nonphysical attributes or behaviors. For example, it is possible to receive a score of zero on a reading comprehension test, right? But here’s the big question: Does getting such a score indicate that one has no reading ability? They cannot read at all, even with a little accuracy? Of course not. It means only that on this test, they scored really, really low. (Maybe you are the sort of student who is about to raise your hand and say something like, “But what about a one-day-old baby?! They have zero reading ability!!” Yes, but the test isn’t designed for one-day-old babies. Nice try.) That’s the challenge, then. Is there any trait or characteristic in the behavioral or social sciences that an individual can have a complete absence of? If there is not,

Chapter 2 

■ 

Levels of Measurement and Their Importance  

then a ratio level of measurement is impossible. In fact, this is one reason why, when this category of tests and measurement is taught, the interval and ratio levels often are combined into one. We’re not doing that here, because we think they are important enough to keep separated. Now, in the physical and biological sciences, it’s not as much of an issue or challenge. Consider temperature as measured on the Kelvin scale. This scale starts at 0, which means no heat at all (somewhere in deep space maybe) and goes up, up, up. The sun is almost 6,000 degrees on the Kelvin scale. There are no negative values among those possible scores. Compare that to Celsius or Fahrenheit scoring systems for temperature. It can be negative degrees on those scales, like –20 degrees Fahrenheit at the North Pole or somewhere else. (Notice with the variable of heat, one can choose to measure it at the interval level, as with the Fahrenheit system, or choose to measure it at the ratio level. Researchers and test developers have to decide which makes the most sense theoretically.) How about finger tapping as a measure of responsiveness? It is entirely possible to have no finger taps. Both the number of finger taps and the Kelvin scale truly have a true zero. But even if someone doesn’t get anything correct on an intelligence test (perhaps one taken in a language from Mars), does that mean they have no intelligence? Of course not. By the way, we call it ratio level, because proportions and fractions have meaning. You can tap your fingers twice as much as your classmate when you are nervous, so comparing your finger tapping score of 120 (taps per minute) to their score of 60 by saying you tap twice as often is reasonable and makes sense. But if it is 30 degrees Fahrenheit today and it was 15 degrees yesterday, we don’t say it is twice as hot today. Because scores on the Fahrenheit scale go below zero, we aren’t allowed mathematically to make ratio comparisons like that. But on the Kelvin scale we can! For social and behavioral scientists (like us, and like you, probably), we will rarely (if ever) see a ratio-level scale in the journal articles we review and read. The scale of measurement simply depends on how the variable is being defined and measured.

A SUMMARY: HOW LEVELS OF MEASUREMENT DIFFER We just discussed four different levels of measurement and what some of their characteristics are. You also know by now that a more precise level of measurement has all the characteristics of an earlier level and provides more information as well. In Table 2.1, you can see a summary that addresses the following questions: 1. Are you measuring most of the available information? 2. Can you assign a name to the variable being measured?

29

30  Part I 

■ 

TABLE 2.1 

The Basics

 A Summary of Levels of Measurement and the Characteristics That Define Them

Can You Assign Names to the Different Scores or Outcomes?

Can You Assign an Order to the Different Scores or Outcomes?

Can You Assign an Underlying Quantitative Scale to the Different Scores or Outcomes?

Can You Assign an Absolute Zero to the Scale You Are Using?

Level of Measurement

How Much Information About the Variable Do You Get?

Ratio

A great amount









Interval

A good amount







L

Ordinal

Some





L

L

Nominal

A little



L

L

L

3. Can you assign an order to the variable being measured? 4. Can you assign an underlying quantitative scale to the variable being measured? Remember, the more precise your level of measurement, the more information is conveyed. This table shows us that the ratio level of measurement allows us to answer yes (☺) to these four questions, whereas the nominal level of measurement allows us to answer yes to only one.

OKAY, SO WHAT’S THE LESSON HERE? The lesson here is that, when you can, try to select a technique for measuring a variable that allows you to use the highest level of measurement possible (most often the interval level). We want to access the most information available with the most precision while understanding that as the scale of measurement changes in precision, the way the variables are being measured will probably change in level of complexity as well, as you can see in Figure 2.2. As variables are measured in more sophisticated ways and become more complex in their definition and nature, they lend themselves better to higher scales of measurement (such as interval or ratio) than do variables that are less complex. For example, when testing the effectiveness of strength training in senior citizens, don’t classify them as weak or strong after the intervention is over and after they have been tested. Rather, try to get a ranking of how strong they are, and even better, try to get an actual number associated with strength, like the amount of weight

Chapter 2 

FIGURE 2.2 

■ 

Levels of Measurement and Their Importance  

  Complexity and Precision and Scales of Measurement High

Ratio Scale

High

Interval Scale Precision

Complexity Ordinal Scale

Low

Nominal Scale

Low

they can lift. That provides much more information and makes your entire quest for knowledge a more powerful one. But the real world sometimes demands that certain outcomes be measured in certain ways, and that limits us as to the amount of information available. For example, what if you wanted to study prejudice? You may not be able to ascertain anything more than placing participants into ordinal levels called very prejudiced, somewhat prejudiced, and not at all prejudiced. Not as much information as we might like but not bad either. It is what it is.

THE FINAL WORD(S) ON LEVELS OF MEASUREMENT Okay, so we have the four levels of measurement (three of which are very commonly used)—what can we say about all of them? Here are at least five things: 1. What we measure—be it a score on a test of intelligence, the number correct on a chemistry final exam, or your feelings about peanut M&Ms— belongs to one of these four levels of measurement. The key, of course, is how finely and precisely the variable is being measured. 2. The qualities of one level of measurement (such as nominal) are characteristic of the next level up as well. In other words, variables measured at the ordinal level also contain the qualities of variables measured at the nominal level. Likewise, variables measured at the interval level contain the qualities of variables measured at both the nominal and ordinal levels. For example, if you know that Mateo swims faster than Laurie, that’s great information (and it’s ordinal in nature). But if you

31

32  Part I 

■ 

The Basics

know that Mateo swims 7.6 seconds faster than Laurie, that’s interval-level information, which is even better. 3. The more precise (and higher) the level of measurement (with ratio being highest), the closer you’ll get to measuring the true outcome of interest. 4. How you choose to measure an outcome defines the outcome’s level of measurement. That decision might be based on theory, the realities of how the variable manifests itself in the world, or just based on a choice you make as the researcher or test developer. 5. Many researchers take some liberty in treating ordinal variables (such as scores on a personality test) as interval-level variables, and that might be fine, especially if the scores distribute themselves in ways similar to how interval-level scores do. See Chapter 5 for a discussion of the statistical issues involved. To make matters even more complicated, even if scales (such as those of intelligence) are interval-level measures, does one assume that the five-point difference between a score of 100 and a score of 105 is equivalent to a five-point difference between a score of 125 and 130? An interval-level scale would lead us to believe that, but nope—that’s not always the case. Moving from a score of 100 to 105 (around average) is not anywhere near the change that is represented by going from a score of 125 to 130 (very high scores). These four levels of measurement are not carved in stone. We might contend that most measures in the social and behavioral sciences fall at the ordinal or interval level; in practice, however, we surely act as though many (if not all) occur at the interval level when they actually probably occur at the ordinal level. There are moderately complex statistical issues when treating ordinal variables as interval level (which are beyond the scope of this book), but it turns out that in most research situations it is probably okay, especially when a total score from a test is used. (Our statistician colleagues cringe a little when we say this sort of thing is “okay,” but just a little.) No matter what position one takes, though, all agree that one should try to measure at the highest level possible.

Summary We just did oodles of good work. (By the way, we measure oodles on an ordinal scale.) Now that we understand what levels of measurement are and how they work, we will turn our attention to the first of two very important topics in the study of tests and measurement—reliability. And that discussion comes in the next chapter.

Chapter 2 

■ 

Levels of Measurement and Their Importance  

Time to Practice 1.

Why are levels of measurement useful?

2. Provide an example of how a variable can be measured at more than one level of measurement. 3. For the following variables, define at what level of measurement each one occurs and tell why. a. Hair color—for example, red, brown, or blonde. b. IQ score—for example, 110 or 143. c. Average number of Volvos owned by each family in Kansas City, Missouri—for example, two or three. d. The number correct on a third-grade math test out of 20 possible correct—for example, 17, 19, or 20. e. Time running 100 yards—for example, 15 seconds or 12.5 seconds. 4. Access the library database and select three journal articles that include empirical studies. Be sure that you select these articles from your own discipline. Now, for each one, answer the following questions: a. What is the primary variable of interest, or what variable is being studied? b. How is it being measured? c. At what level of measurement is it being measured? d. How precise do you think the measurement is? 5. Why does the interval level of measurement provide more information than the nominal level of measurement, and why would you want to use the interval level of measurement if you have a choice. 6. Select five variables that are important to measure in your area of expertise, and identify how they can be measured at the nominal, ordinal, or interval level of measurement. 7.

Describe how you could measure temperature with each of the four levels of measurement.

8. Here’s a mind bender: Which level of measurement are we using when we say that the nominal level of measurement is the least useful, followed by the ordinal level, the interval level, and then the ratio level as the most useful level of measurement? 9.

Come up with a research question of interest to you. What is the highest level of measurement you could use to measure each variable, and how would you do so?

Want to Know More? Further Readings •

Laird, S. P., Snyder, C. R., Rapoff, M. A., & Green, S. (2004). Measuring private prayer: Development, validation, and clinical application of the Multidimensional Prayer Inventory. International Journal for the Psychology of Religion, 14(4), 251–272.

33

34  Part I 

■ 

The Basics

How would you measure prayer?? Quantitative and qualitative aspects of prayer were assessed, and results revealed five distinct types of prayer. This study was a very interesting use of measurement. •

McHugh, M. L., & Hudson-Barr, D. (Eds.). (2003). Descriptive statistics, Part II: Most commonly used descriptive statistics. Journal for Specialists in Pediatric Nursing, 8, 111–116.

These authors support what you have already read here—that data from different levels of measurement require different statistical measures. And to use descriptive statistics in the best manner, it is important to know what measurement levels should be used with the statistics and what information the statistics can provide.

And on Some Interesting Websites •

Learn all about counting systems (also a kind of measurement level) at http://galileoandeinstein .physics.virginia.edu/lectures/babylon.html.



And of course, YouTube to the rescue. The short video from Rice University at https:// www.youtube.com/watch?v=B0ABvLa_u88 provides a nice overview of the four levels of measurement and their usefulness.



Not the entire article (you can easily get that online through your own institution), but very cool nonetheless—read the first page of S. S. Stevens’s cornerstone article on scales of measurement at http://www.jstor.org/stable/1671815?origin=JSTOR-pdf&seq=1#page_scan_tab_contents, and read more to be extra informed on this very critical topic.

And in the Real Testing World Real World 1 Professor Granberg-Rademacker from the University of Minnesota in Mankato designed a technique that helps make the ordinal level of measurement behave more like an interval- or ratio-level measure, increasing the scale’s precision and possibly its accuracy. Want to know more? Granberg-Rademacker, J. S. (2010). An algorithm for converting ordinal scale measurement data to interval/ratio scale. Educational and Psychological Measurement, 70, 74–90.

Real World 2 This study of news and how it is covered and reported examined the coverage of the pre–Iraq War debate in the New York Times and Washington Post. Professor Groshek cleverly looked at how coverage declined across all source types and levels of measurement after the congressional resolution to go to war. Want to know more? Groshek, J. (2008). Coverage of the pre–Iraq War debate as a case study of frame indexing. Media, War & Conflict, 1(3), 315–338.

Real World 3 Everywhere you look today, there’s something about health and activity levels and even more about collecting data on our own activities. These researchers examined the relationship between selfreported physical activity and pedometer steps with health-related quality of life in 1,296 older

Chapter 2 

■ 

Levels of Measurement and Their Importance  

adults. They found that members of the high-step group had significantly higher scores on mental health, physical health, and global health than did the low-step group. Also, older adults had significantly higher health-related quality-of-life indices. What is so interesting about this study for our purposes is that much of the information relied on the reporting of the participants as far as steps walked (they reported their steps each day). Want to know more? Vallance, J., Eurich, D., Gardiner, P., Taylor, L., & Johnson, S. (2016). Associations of daily pedometer steps and self-reported physical activity with health-related quality of life. Journal of Aging and Health, 28, 661–674.

35

3 RELIABILITY AND ITS IMPORTANCE Getting It Right Every Time Difficulty Index ☺ ☺ (tougher than most and pretty long)

LEARNING OBJECTIVES After reading this chapter, you should be able to •

Explain that the observed score one gets on a test is made up of a true score and an error score.



Define reliability in the context of Classical Test Theory.



Compute and interpret correlation coefficients.



Differentiate between four types of reliability.



Interpret the size of reliability coefficients.



Apply strategies for producing reliable tests when you can’t calculate a reliability coefficient.

A

good test is reliable and valid and this chapter and Chapter 4 talk about what that means. In this chapter, we start with reliability. Reliability is pretty easy to figure out, even though we have given this chapter a tougher than most difficulty designation. Students, and even instructors, sometimes have trouble being

37

38  Part I 

■ 

The Basics

comfortable with what reliability really means because it is often defined in a way that sounds a lot like validity. We have cleverly decided to cover reliability first so we can avoid that confusion. (And for your own good, we won’t even tell you what validity means just now!) Reliability simply refers to how much randomness there is in a bunch of scores. Imagine a single person taking a test and then immediately taking that same test again. Would they get the exact same score the second time? Intuitively, you might say, no, but the second score would be pretty close to the first score. Right, because we know in real life there are all sorts of random variables bouncing against us when we take a quiz or respond to an attitude survey that affect how we respond. You know, sometimes you have to guess an answer on a multiple-choice test and sometimes you guess right and sometimes you guess wrong. That’s what we mean by random. If you are asked your feelings about tofu by a marketing researcher, you might have had a good day and love everything in the world including tofu. But if you tripped over your cat on the way to grabbing your phone, you might feel negatively toward tofu and everything else! That’s more randomness. Good test developers want their scores not to be random. They want their measures to be reliable. Remember your intuitive thinking about reliability, that two scores from the same measure taken by the same person will be different, but they should be close. To understand reliability, we focus on how close those two scores are. Reliability is all about the consistency of scores if people took the same test twice. As an example, personality traits are believed to be fairly stable over a long period of time. In other words, if you are an introvert today, you’re still likely to be an introvert six months from now. If you administer a test of introversion to a group of people, their scores will vary. There will be a bunch of different scores. That is what we would expect with a good test because people have different levels of introversion. So, some of that variability is true (measurement folks think of that part of the variability as systematic). Some of that variability, though, is only due to randomness. We talk about people taking a test twice and comparing their pairs of scores because that is a way of seeing whether there is randomness in the variability of scores. High consistency means low randomness. To actually quantify reliability, to “put a number on it” (to paraphrase Beyoncé), researchers and test developers find ways to get a pair of scores for a bunch of individuals from the same test. It turns out, there are several ways to get two scores from the same test. The obvious way is to do a test–retest reliability study just like what we have been using as an example. Get a bunch of people. Give them your test. And then give it to them again. Compare the scores. A second way, for measures where there is subjective human judgment involved in assigning scores, is to have two people or “raters” score the same person. Then, compare their two scores. Cindy Woods, in her dissertation work, demonstrated the importance of interrater reliability. She performed an analysis of 18 studies where she examined parent–child coercive interactions that involved parenting practices and punishment and the impact of those interactions on the development of boys’

Chapter 3 

■ 

Reliability and Its Importance  

aggression. Aggression in children, the primary variable of interest, is often measured through observation and sometimes two different observers score the same child. Before she could reach conclusions about what leads to aggression, Woods first had to verify that there was good interrater reliability for the subjective nature of the observational measures that were used. It turned out that there was high interrater reliability, so she could analyze her data. She found that unskilled parenting had the most significant effect on the later development of aggression in boys. Want to know more? Take a look at Woods, C. (2004). Unskilled parenting practices and their effect on the development of aggression in boys: A meta-analysis of coercion theory based studies. Dissertation Abstracts International: Section B: The Sciences and Engineering, 64(11B), 5808. We’ll get to the other forms of reliability and how they are computed, but first a bit about how reliability works and the beautiful simplicity of the measurement theory behind it.

TEST SCORES: TRUTH OR DARE What really is so cool about the whole notion of reliability is that it is based on the separation of the different components of what makes up any test score. When you take a test in this class, you may get a score such as 89 (good for you) or 65 (back to the books!). That test score consists of several different elements: the observed score (the score you actually get on the test, such as 89 or 65), a true score (what your typical score would be on the test), and an error score (how big a gap there is between the observed score and true score). We can see your observed score. There it is posted online or at the top of your test paper. But we can’t see your true score. The true score is your typical score, which means it is the average score you would get if you took the same test an infinite number of times. Well, test developers and researchers don’t really have the time or money to give a group of people the same test an infinite number of times, right? Or even to do it a hundred times. Or five times. But they can maybe give the same test two times. . . . See where this is headed? We are starting to get to the logic of using test–retest designs to examine reliability. They are a way of estimating true scores! And by guessing at the true score and knowing for sure what the observed score is, we can calculate the error score. The smaller that is, the more reliable a measure is. Why aren’t true and observed scores equal to each other? Well, they would be if the test produced scores without any randomness at all. Humans, though, aren’t perfect. We aren’t consistent. It’s almost impossible to predict what we are going to do. That’s one reason why the sciences of psychology and education are not as advanced as, say, the sciences of physics or astronomy. We can tell you exactly when a comet will pass overhead, but we can’t tell you the exact score you will get on a test, even if it’s the second time you take it!

39

40  Part I 

■ 

The Basics

Randomness in measurement is a fact of life. The Yankees don’t always win, the bread only mostly falls on the buttered side, and Murphy’s Law tells us that the world is not yet perfect. So what you see as an observed score may come close to the true score, but rarely (almost never) are they the same. The difference, as you will see here, is in the amount of error that is introduced. Error? Yes—in all its glory. For example, let’s suppose for a moment that someone gets an 89 on their tests and measurement final, but the true score (which we never really know) is 80. That means that the 9-point difference (the amount of error) is due to randomness. Error is the difference between one’s observed score and one’s theoretical true score. What might be the source of such error? Well, perhaps the room in which the test is taken is so warm that it’s easy for the student to fall asleep. That would certainly have an impact on a test score. Or perhaps the student didn’t study for the test as much as they sometimes do. Ditto. Or maybe the student just isn’t in the mood that day to undertake a four-hour final. These are all sources of error that can contribute to the unreliability of an instrument, because these sources mask the true performance or true score, which would be measured if these sources of random error were not present. Nothing about this tests and measurement stuff is clear-cut, and this true score stuff can sometimes make your head hurt. Here’s why: We just defined true score as the average score you would get if you took the same test an infinite number of times. Notice that true score has nothing to do with whether the construct of interest is really being reflected. The true score only represents the theoretical typical level of performance on a given test. Of course, one would hope that the typical level of performance would also indicate the level of whatever the test is supposed to measure, but that’s another question (spoiler alert: see Chapter 4). The distinction here is that a test is reliable if it consistently produces whatever score a person would get on average, regardless of what the test is measuring. In fact, a perfectly reliable test might not produce a score that has anything to do with the construct of interest, such as how introverted you are, how much you know, whether you can drive, or how much you weigh. Another way you might want to think of reliability is in terms of precision or how accurate the testing process is in its intent to hit the target. The more precise (the more accurate in assessing a behavior from time to time, for example), the more reliable.

GETTING CONCEPTUAL The less error (of any kind from any source), the more reliability—it’s that simple.

Chapter 3 

■ 

Reliability and Its Importance  

So what we know up to this point is that the score we observe (the result of, let’s say, a spelling test) is composed of an individual’s true score (the score they would normally get theoretically) and something we call an error score. The formula shown here gives you an idea as to how these two relate to each other: Observed score = True score + Error score There is a whole theory based on this simple equation, Classical Test Theory. Now let’s take a moment and go one step further. The error part of this simple equation consists of two types of error, one called trait error and one called method error—both of which (once again) contribute to differences between the true and observed score, right? Trait errors are those sources of error that reside within the individual taking the test (such as “I didn’t study enough this time,” “I’ve just fallen in love,” or “I forgot to set the alarm”). Method errors are those sources of error that reside in the testing situation (such as computer glitches, too-warm room, or missing pages). If we expand the earlier simple equation and show you what you just read, we get this equation: Observed score = True score + (Trait error score + Method error score) This concept of reliability, as defined by this equation, is really both simple and profound. Look what happens as different values change in this formula. As the error component (error score) gets smaller, what happens to the reliability value? It gets larger. And in the most perfect of all worlds, what happens if there is no error score at all? Voila! Reliability is perfect, because it is equal to an individual’s true score. Our job is to reduce those sources of error as much as possible by, for example, having good test-taking conditions and making sure you are encouraged to get enough sleep. Reduce the error and you increase the reliability, because the observed score more closely matches the true score. In more technical terms, reliability goes something like this: Scores on repeated testings tend to vary. What the concept of reliability allows us to do is understand which proportion of the variation in test scores is due to actual changes in performance or behavior and which is due to error variance. Reducing that error variance is what makes a test more reliable. To move a bit more deeply into theory, reliability really refers to the proportion of the variance in a set of observed scores that is true score variance. So psychometricians (measurement experts) write the equation like this: Reliability = True score variance/Observed score variance

41

42  Part I 

■ 

The Basics

We just gave you (more or less) the classical approach or model of reliability, but there’s another one you should at least know about that has to do with Item Response Theory, or IRT. IRT is based on the idea that the likelihood of a correct response is a (complex) function of the person taking the test and the characteristics of the items (and all this can get very mathematically intriguing ☺). You’ll get this material in greater depth in Chapter 6.

PUTTING A NUMBER ON RELIABILITY Think about it, if reliability reflects the consistency of test scores when the test is taken twice (or scored twice by two different people), then there should be a correlation between the two sets of scores. A very common statistic to reflect the correlation between two sets of scores is the correlation coefficient, and, in practice, reliability is often quantified through the computation of a correlation coefficient. (And if you already know all about correlations, you can skip this whole section!) A correlation coefficient is a numerical index that tells us how closely two variables (such as two administrations of the same test) are related to each other—more accurately, how much the variables share or have in common. For reliability purposes, correlation coefficients tend to range between .00 and +1.00. The higher the (positive) number, the more reliable the test.

Computing a Simple Correlation Coefficient The computational formula for what is technically called the Pearson productmoment correlation coefficient between a variable labeled X and a variable labeled Y is shown here: rxy =

n ∑ XY − ∑ X ∑Y  n ∑ X 2 − ( ∑ X )2   n ∑ Y 2 − ( ∑ Y )2    

where rxy = the correlation coefficient between X and Y n = the size of the sample X = the individual’s score on the X variable Y = the individual’s score on the Y variable XY = the product of each X score times its corresponding Y score X 2 = the individual X score, squared Y 2 = the individual Y score, squared

Chapter 3 

■ 

Reliability and Its Importance  

Though this chapter focuses on reliability, in the social science research world, you are more likely to see correlations that show two different variables and how strongly they are related, so our first example involves that sort of situation. Here are some test scores (a screening test for language skills in young children is the X variable, and a screening test for physical skills in young children is the Y variable) we will use in this example:

Sum or ∑

Language Skills (X)

Physical Skills (Y)

X2

Y2

XY

 2

 3

  4

  9

  6

 4

 2

 16

  4

  8

 5

 6

 25

 36

 30

 6

 5

 36

 25

 30

 4

 3

 16

  9

 12

 7

 6

 49

 36

 42

 8

 5

 64

 25

 40

 5

 4

 25

 16

 20

 6

 4

 36

 16

 24

 7

 5

 49

 25

 35

54

43

320

201

247

Before we plug in the numbers, let’s make sure you understand what each one represents. ∑X, or the sum of all the X values, is 54. ∑Y, or the sum of all the Y values, is 43. ∑X 2, or the sum of each X value squared, is 320. ∑Y 2, or the sum of each Y value squared, is 201. ∑XY, or the sum of the products of X and Y, is 247. Here are the steps in computing the correlation coefficient: 1. List the two values for each participant. You should do this in a column format so as not to get confused. 2. Compute the sum of all the X values, and compute the sum of all the Y values. 3. Square each of the X values, and square each of the Y values. 4. Find the sum of the XY products.

43

44  Part I 

■ 

The Basics

You can see the answer here:

rxy = =

(10 × 247 ) − (54 × 43 ) (10 × 320 ) − 542  (10 × 201) − 432  148 = .692 213.83

The correlation between scores on the language screening test and scores on the physical skills screening test is .692. The Pluses and Minuses of Correlations. You may notice we just said that reliability coefficients range from .00 to +1.00, but know that correlation coefficients (and of course a reliability coefficient is just another form of a correlation coefficient) can range from –1.00 to +1.00. What’s up, and why the discrepancy? Easy. No test can have a modicum of reliability with a coefficient less than .00, so we just dispense with that idea and determine reliability coefficients to be worth considering only when they are positive. In real life, if you compute reliability on software like Excel or SPSS or R and get a negative coefficient, you have probably entered the data wrong. 😊 For an example using reliability, let’s look at the following set of two scores on a 10-item achievement test that is given to 15 adults in September and given again 30 days later in October to the same 15 adults. We have two scores for each adult. We always must have two scores per individual to compute correlations. If this test is reliable, we expect that these two sets of scores will be very similar—and that there will be a high correlation between them. Well, it turns out that, using our formula, the correlation between the score from testing in September and the score from testing in October is .90, certainly high enough for us to conclude that this test is reliable. ID

September Testing

October Testing

1

78

79

2

65

78

3

65

66

4

78

80

5

89

78

6

99

94

7

93

95

8

75

78

9

69

72

Chapter 3 

■ 

Reliability and Its Importance  

ID

September Testing

October Testing

10

87

82

11

45

49

12

66

68

13

87

81

14

85

87

15

78

69

Types of Correlation Coefficients A correlation reflects the dynamic quality of the relationship between variables. In doing so, it allows us to understand whether variables tend to move in the same or opposite directions when they change. If variables change in the same direction, the correlation is called a direct correlation or a positive correlation. If variables change in opposite directions, the correlation is called an indirect correlation or a negative correlation. Table 3.1 shows a summary of these relationships. TABLE 3.1 

 Types of Correlations and the Corresponding Relationship Between Variables

What Happens to Variable X

What Happens to Variable Y

Type of Correlation

X increases in value

Y decreases in value

X decreases in value

Value

Example

Direct or positive

Positive, ranging from .00 to +1.00

The more time you spend studying, the higher your test score will be.

Y decreases in value

Direct or positive

Positive, ranging from .00 to +1.00

The less money you put in the bank, the less interest you will earn.

X increases in value

Y increases in value

Indirect or negative

Negative, ranging from −1.00 to .00

The more you exercise, the less you will weigh.

X decreases in value

Y increases in value

Indirect or negative

Negative, ranging from −1.00 to .00

The less time you take to complete a test, the fewer questions you’ll get right.

One way to interpret the strength of correlations is to use the following rules of thumb.

45

46  Part I 

■ 

The Basics

Size of the Correlation Coefficient

General Interpretation

.8 to 1.0

Very strong relationship

.6 to .8

Strong relationship

.4 to .6

Moderate relationship

.2 to .4

Weak relationship

0 to .2

Weak or no relationship

So if the correlation between two variables is .5, you can safely conclude that the relationship is a moderate one—not strong but certainly not weak enough to say that the variables in question don’t share anything in common. The interpretations shown here about whether correlations are big or small are based on the correlations seen by measurement people every day, with their large reliability coefficients and pretty big correlations among similar tests. Psychologists and other social scientists, though, are used to dealing with correlations among variables, like personality traits, and they tend to be much smaller in the real world. These scientists use a different scale when putting words around the size of correlations, treating .10 as small or weak, .30 as medium or moderate, and .50 as large or strong. Things to Remember About Correlations • A correlation can range in value from –1 to +1. • The absolute value of the coefficient reflects the strength of the correlation. So a correlation of −.70 is stronger than a correlation of +.50. One of the mistakes frequently made regarding correlation coefficients is when students assume that a direct or positive correlation is always stronger (i.e., “better”) than an indirect or negative correlation because of the sign and nothing else. • A correlation always reflects the situation where there are at least two data points (or variables) per case. • Another easy mistake is to assign a value judgment to the sign of the correlation. Many students assume that a negative relationship is not good and a positive one is good. That’s why, instead of using the terms negative and positive, the terms indirect and direct communicate meaning more clearly.

DIFFERENT FLAVORS OF RELIABILITY Reliability can be computed in many different ways, and we’ll cover the four most important and most often used in this section. They are all summarized in Table 3.2.

Chapter 3 

■ 

Reliability and Its Importance  

47

Test–Retest Reliability The first kind of reliability we’ll talk about, test–retest reliability, is used when you want to examine whether a test is reliable over time. This type of reliability is sometimes called stability.

TABLE 3.2 

Type of Reliability

 Different Types of Reliability, When They Are Used, How They Are Computed, and What They Mean

When You Use It

How Do You Do It?

What Can You Say When You’re Done?

Test–retest reliability

When you want to know whether a test is consistent over time

Correlate the scores from a test given at Time 1 with the same test given at Time 2.

The Bonzo test of identity formation for adolescents is reliable over time.

Interrater reliability

When you want to know whether there is consistency between different human scorers in the rating of some outcome

Examine the percentage of agreement between raters. Sometimes interrater reliability is estimated using a correlation coefficient between the two scorers.

The interrater reliability for the best-dressed Foosball player judging was 91% agreement, indicating a high degree of consistency between judges.

Parallel forms reliability

When you want to know if different forms of a test give equivalent scores

Correlate the scores from one form of the test with scores from a second form of the test.

The two forms of the SAT provided similar scores in our study. They show parallel forms reliability.

Internal consistency reliability

When you want to know if performance on one part of the test is similar to performance on another part of the test

The old school way is to correlate scores on the first half of a test with scores on the second half. This is called split-half reliability. Ever since computer use became widespread, a more precise method is used. Cronbach’s alpha essentially correlates every item with every other item on a test and gives an average correlation among all parts of a test.

All the items on the SMART Test of Creativity correlate well with each other producing a large Cronbach’s alpha of .88, suggesting high internal consistency.

There are always different names used for the same topic or procedure in science, and it’s not any different with tests and measurement. You may see test–retest reliability called time sampling, because the samples of scores are taken at more than one point in time. For example, let’s say that you want to evaluate the reliability of a test that will examine preferences for different types of vocational programs. You may administer the test once and then readminister the same test (and it’s important that it be

48  Part I 

■ 

The Basics

the same test) again a week later. Then, the two sets of scores (remember, the same people took it twice) are correlated, and you have a measure of reliability. Test–retest reliability is often calculated and reported when researchers are examining differences or changes over time or after an intervention. You must be very confident that what you are measuring is being measured in a reliable way such that the results you are getting come as close as possible to the individual’s true score each and every time. Here are some scores from tests at Time 1 and Time 2 for the MVET (Mastering Vocational Education Test) under development. ID

Score From Test 1 (Time 1)

Score From Test 2 (Time 2)

 1

54

56

 2

67

77

 3

67

87

 4

83

89

 5

87

89

 6

89

90

 7

84

87

 8

90

92

 9

98

99

10

65

76

To estimate reliability, we compute a correlation coefficient, and get .89, and you know enough by now to interpret that. Oops! The Problems with Test–Retest Reliability You might have thought about these shortcomings already. The biggest criticism of test–retest reliability is that when you administer the same test in succession, you run the risk of practice effects (also called carryover effects). This occurs when the first testing influences the second. In other words, after the first testing, the test takers may remember the questions, ideas, concepts, and so on, and that may have an impact on the second testing and their scores. Another problem might be with the interaction between the amount of time between tests and the nature of the sample being tested. For example, suppose you are working with an instrument that assesses some aspect of growth and development in young children. Because individual differences at young ages are so profound, waiting 3 or 6 months to retest motor skills might result in an inaccurate correlation, not because the test is unreliable but because dramatic changes in

Chapter 3 

■ 

Reliability and Its Importance  

behavior occur at that age over that period of time and they may not occur in a systematic fashion in the entire group of children. It’s like trying to hit a moving target, and indeed, if the change is that rapid (and if there is that much variability among those being tested), there may be no way to establish test–retest reliability.

Interrater Reliability Most types of reliability have to do with the reliability of the scores due to the nature of the instrument, but earlier we saw an example of a type of reliability that is based on the nature of the scoring rules and how humans follow those rules when assigning scores. When the scoring is objective (concrete decision rules assign scores, like with a multiple-choice test; a computer could do it), randomness doesn’t play a role in the scoring, but when scores are subjective (human judgment matters), we are concerned about interrater reliability, where we want to know how much two different scorers might agree on their judgments of some outcome. For example, let’s say you are interested in a particular type of social interaction during a transaction between a banker and a potential checking account customer. You observe both people in real time (you’re hiding behind a potted plant and wearing a fake mustache) to see if the new and improved customer relations course that the banker took resulted in increased smiling and pleasant types of behavior toward the potential customer. The degree of randomness in your scoring of the observations is important for when you do analysis for your research, so you build in an interrater reliability study as part of the project. Your job is to note every 10 seconds if the banker is demonstrating one of the three different behaviors she has been taught—smiling, leaning forward in her chair, or using her hands to make a point. Each time you see any one of those behaviors, you mark it on your scoring sheet as a slash (/). If you observe nothing, you record a dash, like this: —. As part of this process, and to be sure that what you are recording is reliable, you will want to find out what the level of agreement is between different observers as to the occurrence of these behaviors. The more similar the ratings are, the higher the level of interrater agreement and interrater reliability. So, you ask a friend to hide with you and observe independently, without knowing how you scored. In this example, the really important variable is whether or not any one of the three customer-friendly acts occurred within a set of 10-second time frames across 2 minutes (or twelve 10-second periods). So what we are looking at is the rating consistency across a 2-minute time period broken down into twelve 10-second periods. A slash on the scoring sheet means that the behavior occurred, and a dash means it did not. Time Period →

1

2

3

4

5

6

7

8

9

10

11

12

Rater 1

You

/



/

/

/



/

/





/

/

Rater 2

Your friend

/



/

/

/



/

/



/



/

49

50  Part I 

■ 

The Basics

For a total of 12 periods (with 12 opportunities to agree or disagree), there are 7 where both Dave and Anne agreed that the banker did do the customer-friendly thing (Periods 1, 3, 4, 5, 7, 8, and 12), and 3 where they agreed she did not (Periods 2, 6, and 9), for a total of 10 agreements and 2 disagreements. Interrater reliability is computed using the following simple formula. Interrater reliability =

Number of agreements Number of possible agrreements

And when we plug in the numbers as you see here, Interrater reliability =

10 = .833 12

the resulting interrater reliability coefficient is .83. This value would be described as “the two raters agreed 83% of the time.” Notice in the case of interrater reliability, the coefficient we use is an actual proportion. This is not the case with the correlations used to describe other types of reliability. Even though they may look like proportions, they are not.

Parallel Forms Reliability Parallel forms reliability is used when you want to examine the equivalence or similarity between two different forms of the same test. You already have seen how test–retest reliability can appear under a different name (stability or time sampling); well, it’s the same with parallel forms reliability. You may see parallel forms reliability called item sampling, because the samples of scores are taken using different sets of items. For example, let’s say you are doing a study on memory, and part of the Remember Everything Test (RET) is to look at 10 different words, memorize them as best you can, and then recite them back after 20 seconds of study and 10 seconds of rest. As would any good scientist, you want to be sure the reliability of the RET is tested and reported as part of your research. Because this study takes place over a two-day period and involves some training of memory skills, you want to have another set of items that is exactly similar in task demands, but it obviously cannot be the same as far as content (too easy to remember, right?). So you create another list of words that is similar to the first. In this example, you want the consistency to be high across forms; the same ideas are being tested, just using a different form.

Chapter 3 

■ 

Reliability and Its Importance  

Here are some scores from the RET in both Form A and Form B. Our goal is to compute a correlation coefficient as an estimate of the parallel forms reliability of the instrument.

ID

Scores From Form A of the RET

Scores From Form B of the RET

 1

4

5

 2

5

6

 3

3

5

 4

6

6

 5

7

7

 6

5

6

 7

6

7

 8

4

8

 9

3

7

10

3

7

The first and last step in this process is to compute the correlation (which, by the way, is technically the Pearson product-moment correlation). In this example, it is equal to r FormA⋅FormB = .12 The subscript to the r in the above (Form A⋅Form B) indicates that we are looking at the reliability of a test using different forms. We’ll get to the interpretation of this value shortly, but it seems pretty small. Large commercial test development companies that make standardized tests, like the ACT and the SAT, use different forms all the time. If you are taking the ACT in New York, they don’t want you calling your buddy in California who will take the test on the same day and telling him what to study, so it is likely that he will get a different form than the one you are seeing. Rest assured that the ACT company (named ACT, Incorporated) has established parallel forms’ reliability for all its different test forms.

Internal Consistency Reliability Internal consistency reliability is quite a bit different from the three previous types of reliability we have explored. Internal consistency is used when you want to

51

52  Part I 

■ 

The Basics

know whether the items on a test are consistent with one another and likely measure the same thing. If all the items on a test don’t measure the same thing, why would you total all the responses together into a total score? Evidence of internal reliability is necessary when you want to measure a variable or construct by asking a bunch of questions and combining the answers together. Let’s say that you are developing the Attitude Toward Health Care Test (the ATHCT), and you want to make sure the set of 20 items (with individuals responding on a scale from 1 = strongly agree to 5 = strongly disagree) tap into the same construct (which you hope is attitude toward health care). You would look at the score for each item (for a group of test takers) and see if individual scores correlate with the total score. You would expect that people who scored high on certain items (e.g., “I like my HMO.”) would have also scored high on similar items (e.g., “I get good customer service from my HMO.”) and that this would be consistent across all the items on the test. Split-Half Reliability The first and classic way to establish internal consistency of a test is by “splitting” the test into two halves and computing what is affectionately called the split-half reliability coefficient. Here, the scores on one half of the test are compared with scores on the second half of the test to see if there is a strong relationship between the two. If so, then we can conclude that the test has internal consistency. An easy way to estimate internal consistency of a test is through the use of split-half reliability. But remember to apply the Spearman–Brown correction (sometimes called the Spearman–Brown prophecy formula). But like King Solomon, we have a decision to make here. How do we split the test? If it’s a 20-item test, do we take the first 10 items and correlate them with the last 10 items in the group? Or do we take every other item to form an odd group (such as Items 1, 3, 5, and 7) and an even group (such as Items 2, 4, 6, and 8)? It’s easier to do the first half–second half method but dangerous. Why? Because if items tend to be grouped (inadvertently) by subject matter or by difficulty, it is less likely that the groups of items will be deemed equal to each other. Potential trouble in paradise. And if it is a test that is tiring to take, like a long final in a college class, some people might not even get to the later items and score zeroes for those items regardless of their level of knowledge. So for our purposes here (and maybe for your purposes there), it’s best to select all the odd items for one grouping and all the evens for another, and then turn to computing the correlation coefficient. Fifty such scores appear here. We can see this information:

Chapter 3 

■ 

Reliability and Its Importance  

• ID for each participant • Total score on each test • Total score on only the odd items • Total score on only the even items To compute the split-half reliability coefficient as an indicator of how well integrated or how internally consistent the test is, we simply correlate the score of each person on the odd half of their test with the score on the even half. The result? rodd⋅even = .24. Internally consistent? We’ll get there soon. King Solomon Might Have Been Off by Half, or Correct Me If I Am Wrong The big “Oops!” of computing split-half reliabilities is that, in effect, you cut the test in half, and because shorter tests are less reliable, the real degree of reliability is constrained. Spearman–Brown to the rescue! The Spearman–Brown formula makes that correction. It’s simple and straightforward: rt

=

2rh 1 + rh

where rt = the simple Pearson product-moment correlation rh = the half correlation Score on Even Items

ID

Total

Score on Odd Items

Score on Even Items

ID

Total

Score on Odd Items

1

43

21

22

26

14

 6

 8

2

45

18

27

27

36

18

18

3

43

21

22

28

43

25

18

4

46

23

23

29

23

12

11

5

32

15

17

30

44

22

22

6

34

15

19

31

47

23

24

7

21

10

11

32

46

21

25

8

27

13

14

33

37

19

18

9

43

23

20

34

32

12

20 (Continued)

53

54  Part I 

■ 

The Basics

(Continued)

Score on Even Items

ID

Total

Score on Odd Items

Score on Even Items

ID

Total

Score on Odd Items

10

36

18

18

35

38

14

24

11

48

25

23

36

48

25

23

12

42

27

15

37

41

33

 8

13

31

17

14

38

21

13

 8

14

33

16

17

39

46

21

25

15

31

15

16

40

43

15

28

16

45

23

22

41

44

22

22

17

41

22

19

42

41

23

18

18

43

31

12

43

23

11

12

19

46

15

31

44

26

11

15

20

31

16

15

45

28

11

17

21

43

18

25

46

32

14

18

22

42

15

27

47

31

15

16

23

31

15

16

48

45

17

28

24

44

15

29

49

50

29

21

25

50

27

23

50

12

7

 5

Here are the steps we take to compute the corrected split-half reliability estimate: 1. Compute the split-half correlation by either selecting every other item and calculating a score for each half or selecting the first and second halves of the test and calculating a score for each half. 2. Enter the values in the equation you see above and compute rt, the corrected correlation coefficient. For example, if you computed the split-half reliability coefficient as r = .73, then the corrected split-half coefficient would be rt =

2(.73) = .84 1 + .73

That’s a pretty substantial increase. Take a look at this simple chart of split-half correlations before and after they are corrected.

Chapter 3 

■ 

Reliability and Its Importance  

Original Split-Half Reliability

Corrected Split-Half Reliability

Difference in Reliability Coefficients

0.10

.18

.08 (80%)

0.20

.33

.13 (65%)

0.30

.46

.16 (53%)

0.40

.57

.17 (43%)

0.50

.67

.17 (34%)

0.60

.75

.15 (25%)

0.70

.82

.12 (17%)

0.80

.89

.09 (11%)

0.90

.95

.05 (5%)

Let’s first understand what we have here (look at the bolded entries in the last column of the table). To begin with, if you have a split-half reliability coefficient of 0.5, after correction it is .67. And if the corrected split-half reliability coefficient increases from .50 to .67, you have an increase of .17. And that’s an increase in reliability of 34% (.17/.50 = 34). Got all that? Now, here’s what’s really, really interesting. If you look at the table, you can see that the amount of increase for corrected split-half reliability coefficients decreases as the original split-half reliability coefficient increases, right? Why? Simple—the more reliable the original estimate (the first column), the less room for improvement when corrected. In other words, as a test becomes more reliable, the less room it has for change. King Solomon was wise in his decision, but splitting a test in half means half as long a test, and that can create a problem. That’s because shorter tests are less reliable than longer ones in general. That’s true for a couple reasons, but the main one is that the more chances to respond decreases the amount of randomness in the total score. More observations increase precision. For example, if you are preparing a history achievement test on the American Civil War, 20 items would surely cover some information, but 100 would cover much more—a much more representative sample of what could be tested that greatly increases chances that the test is consistent. It also makes sense that the randomness would cancel itself out across many observations. Think of the example of an attitude survey. You might score a little higher on an attitude item than you typically would on one item and then a little lower on another item. Or guess a multiple-choice question correctly and then later on the same test make a guess that is wrong. The more items, the longer the test, the more likely that randomness will play less of a role in your score.

55

56  Part I 

■ 

The Basics

Cronbach’s Alpha (or α) Now here’s our second way of computing internal consistency estimates for a test: Cronbach’s alpha (also referred to as coefficient alpha), symbolized by the cool little Greek letter alpha, or letter a, which looks like this: α. This page lists some sample data for 10 people on a five-item attitude test (the I ♥ HMO Test), where scores are between 1 (strongly disagree) and 5 (strongly agree) on each item. (This format, with the strongly disagree to strongly agree range of responses, is called a Likert-type scale, by the way.) Cronbach’s alpha is especially useful when you are looking at the reliability of a test that doesn’t have right or wrong answers, such as a personality or attitude test, but can also be used to evaluate the reliability of tests with right/wrong answers as well. ID

Item 1

Item 2

Item 3

Item 4

Item 5

1

3

5

1

4

1

2

4

4

3

5

3

3

3

4

4

4

4

4

3

3

5

2

1

5

3

4

5

4

3

6

4

5

5

3

2

7

2

5

5

3

4

8

3

4

4

2

4

9

3

5

4

4

3

10

3

3

2

3

2

When you compute Cronbach’s alpha (named after educational researcher Lee Cronbach), you are actually correlating the score for each item with the total score for each individual and then comparing that with the variability present for all individual item scores. The logic is that any individual test taker with a high(er) total test score should have a high(er) score on each item (such as 5, 5, 3, 5, 3, 4, 4, 2, 4, 5) for a total score of 40, and that any individual test taker with a low(er) total test score should have a low(er) score on each individual item (such as 3, 1, 4, 2, 4, 3, 1, 5, 5, 1, 5, 1) for a total score of 35. We will show you the formula to compute Cronbach’s alpha, if you promise not to scream in terror. It’s pretty daunting looking:  k  α=   k −1 

 s 2y − Σsi2   s 2y 

   

Chapter 3 

■ 

Reliability and Its Importance  

where k = the number of items s y2 = the variance associated with the observed score Σsi2 = the sum of all the variances for each item The next table shows the same set of data with the values (the variance associated with the observed score, or s 2y , and the sum of all the variances for each item) needed to complete the previous equation, or Σsi2. When you plug all these figures in and get the following equation,  5   6.40 − 5.18  α=    = .24 6.4  5 −1    you find that coefficient alpha is .24 and you’re done (except for the interpretation that comes later!). ID

Item 1

Item 2

Item 3

Item 4

Item 5

Total Score

1

3

5

1

4

1

14

2

4

4

3

5

3

19

3

3

4

4

4

4

19

4

3

3

5

2

1

14

5

3

4

5

4

3

19

6

4

5

5

3

2

19

7

2

5

5

3

4

19

8

3

4

4

2

4

17

9

3

5

4

4

3

19

10

3

3

2

3

2

13

s2y = 6.4 Item Variance

0.32

0.62

1.96

0.93

1.34

Σsi2 = 5.18

The Last One: Internal Consistency When You’re Right or Wrong, and Kuder-Richardson We’ve gone through several different ways of estimating internal consistency, and this is the last one. The Kuder-Richardson formulas (there’s one called 20 and one called 21) are used when answers are right or wrong, such as in multiple-choice tests. Variables with only two possible scores, like right or wrong answers to

57

58  Part I 

■ 

The Basics

questions, are dichotomous. Kuder-Richardson estimates are designed especially for dichotomous items. Here are some data for us to work with on another 10-item test, this one containing questions that could be answered correctly (a score of 1) or incorrectly (a score of 0).

ID

Item 1

Item 2

Item 3

Item 4

Item 5

Number Correct

1

1

1

1

1

1

5

2

1

1

1

1

1

5

3

1

1

1

1

1

5

4

1

1

1

0

1

4

5

1

1

1

1

1

5

6

1

1

1

0

0

3

7

1

1

1

1

0

4

8

1

1

0

1

0

3

9

1

1

1

0

1

4

10

0

0

0

1

1

2

% Correct (P)

0.90

0.90

0.80

0.70

0.70

% Incorrect (Q)

0.10

0.10

0.20

0.30

0.30

P*Q

0.09

0.09

0.16

0.21

0.21

Sum of P*Q

0.76

Variance of Number Correct

1.11

Number of Items

5

Number of Test Takers

10

where ID = the test taker’s ID number Item 1, Item 2, etc. = whether or not the item was correct (1) or not (0) Number correct = the total number of correct items P = the percentage of individuals who got an item correct Q = the percentage of individuals who got an item incorrect P*Q = the product of P and Q Variance = the variance of the number correct on the test across individuals

Chapter 3 

■ 

Reliability and Its Importance  

59

And the magic formula is 2  n   s − ∑ PQ  KR20 =    s2  n −1   

where n = the number of items on the test s2 = the variance of total test scores ∑PQ = the sum of the product of the percentage correct and the percentage incorrect on each item This formula, as scary looking as Cronbach’s alpha, is actually mathematically equivalent to Cronbach’s alpha (and, in test manuals and research papers, it is usually Cronbach’s alpha that is reported for every type of measure, whether it is made up of “dichotomously” scored (1/0) items or questions with a range of possible scores (such as Likert-type “1 to 5, strongly disagree to strongly agree” items). When we plug this data into the KR 20 formula, the grand total (drumroll, please) is  5   1.11 − .76  KR20 =   = .40   5 − 1   1.11  A KR 20 of .40—good, bad, or indifferent? Hang on for more soon.

HOW BIG IS BIG? INTERPRETING RELIABILITY COEFFICIENTS Okay—now we get down to the business of better understanding just how big a reliability coefficient, regardless of its flavor (test–retest, etc.), has to be to be considered “acceptable.” We want only two things here: • We want reliability coefficients to be positive. • We want reliability coefficients that are as large as possible (close to +1.00). For example, let’s look at the reliability coefficients we computed for the four types of reliability discussed in this chapter and make some judgments. Type of Reliability

Sample Value

Test–retest reliability

.89

Interpretation

What’s Next?

The test is reasonably consistent over time. A reasonable goal is for the coefficient to be above .70, but better to be in the .80s or .90s.

Not much. This is a pretty reliable test, and you can move forward using it with confidence.

(Continued)

60  Part I 

■ 

The Basics

(Continued)

Type of Reliability

Sample Value

Interrater reliability

Interpretation

What’s Next?

83% agreement

There was adequate agreement between raters.

For subjectively scored measures, it is important to establish that it doesn’t matter who does the scoring. You should consider trying to improve the scoring rules, so that there is better agreement. But, in practice, this percentage of agreement isn’t awful.

Parallel forms reliability

.12

The test does not seem to be very consistent over different forms. The value .12 is a very low reliability coefficient.

Work on the development of a new and better alternative form of the test.

Internal consistency reliability

.24

The test does not seem to be onedimensional in that these items are not consistently measuring the same thing.

Be sure that the items on the test measure what they are supposed to (which, by the way, is a validity issue—stay tuned for the next chapter).

In general, an acceptable reliability coefficient is .70 or above, but much more acceptable is .80 or above. It’s rare to see values above .90, except for very long standardized tests with lots of questions, like intelligence tests or college admission tests. However, when it comes to interrater reliability, we should really expect nothing less than 85% agreement. The level is so easily raised (just have the judges do more training or improve the scoring instructions) that there is no reason why this higher level should not be reached.

THINGS TO REMEMBER Okay, here’s the big warning. If you’re reading along in a journal article and realize, “Hey—there’s nothing here about the reliability of the instruments they used,” then a little red flag should go up. There are usually two possible reasons for this. The first is that the test being used is so well known and so popular that it is common knowledge in the field that this test is reliable. That would be true for such tests as the Wechsler Intelligence Scale for Children, the Minnesota Multiphasic Personality Inventory, or the SAT. The second possible reason is that the original designers of the test never collected the kind of data they needed to make a judgment about the reliability of the test—a very dangerous and questionable situation. If someone is going to go to the effort of establishing the reliability of a test and not use it unless it is reliable, they are surely going to brag about it a bit. If the information isn’t there, and it is not because of the first reason earlier, look for trouble

Chapter 3 

■ 

Reliability and Its Importance  

in River City beginning not with a T, but with a U (for unreliable). (Bruce wants to explain to the young people that “River City” refers to a song in the ancient movie and stage musical, The Music Man. Neil would want to add that you should see this movie first chance you get.) Even if a test is long established and trusted, researchers should still compute the reliability for their own sample, though. There are many statistical procedures that are affected by low reliability and can be adjusted and improved if one knows the reliability coefficient. And there’s another conceptual way to understand and assess reliability, and that’s through the use of the standard error of measurement (or SEM). This is a measure of how much randomness or error one might expect in an individual’s observed score. Basically, the SEM is the amount of variability one might expect in an individual’s true score if that test is taken an infinite number of times. And as you might expect, the higher the reliability of a test (and the more precise it is and the more accurately it reflects one’s true score), the lower the SEM. In fact, if you’ve had a stats class, you probably remember how one standard deviation encompasses about 68% of all the scores under a normal distribution. Well, here’s the non-surprise: For any test, about 68% of the observed test scores fall within +/–1 SEM of the true score (and this is true for 95% and +/–2 SEMs, and 99% and +/–3 SEMs). Pretty cool.

AND IF YOU CAN’T ESTABLISH RELIABILITY . . . THEN WHAT? The road to establishing the reliability of a test is not a smooth one at all, and not one that does not take a good deal of work. What if the test is not reliable? Here are a few things to keep in mind. Remember that reliability is a function of how much error contributes to the observed score. Lower that error and you increase the reliability. • Make sure the directions on the test itself are standardized across all settings when the test is administered. • Increase the number of items or observations, because the larger the sample from the universe of behaviors you are investigating, the more likely the sample is representative and reliable. This is especially true for achievement tests. • Delete unclear items, because people will respond in one way on one occasion and in another way on a different occasion, regardless of their knowledge, ability level, or individual traits.

61

62  Part I 

■ 

The Basics

• For achievement tests especially (such as spelling or history tests), don’t make questions really hard or really easy, because any test that is too difficult or too easy does not reflect an accurate picture of one’s performance. You want scores to be able to vary to reflect the variability of true scores in your sample. • Minimize the effects of external events and standardize all the testing procedures. If a particularly important event—such as Mardi Gras or graduation or a pandemic—occurs near the time of testing, try to reschedule an assessment.

JUST ONE MORE THING (AND IT’S A BIG ONE) The first step in creating or using an instrument that has sound psychometric (“measuring the mind”) properties is to establish its reliability (and we just spent some good time on that). Why? Well, if a test or measurement instrument is not reliable, is not consistent, and does not do the same thing time after time after time, it can’t possibly be measuring what it is supposed to. Random scores do not provide information. Let’s say you are looking at the effects of X on Y and you create some test to measure Y. If the test you create is not reliable, how can you ever know that X actually caused any change you see in Y? Perhaps the change was just due to random variation and error and nothing related to X. And if there is no change in Y, how do you know it’s not due to a poorly constructed and developed test rather than the fact that X has no effect? This is not easy stuff and takes thoughtfulness on the part of the practitioner as well as the consumer. Know whether or not your test is unreliable, what kind of reliability is important given the purpose of the test, and how to increase reliability if necessary.

Summary Reliability of test instruments is essential to good science no matter what you are studying. You’ve learned about several ways reliability can be established. Now it’s time to move on to the other indicator of measurement quality, validity, and discuss why validity is essential and how it is established.

Chapter 3 

■ 

Reliability and Its Importance  

Time to Practice 1.

Go to the library and find five articles from journals in your field or discipline that do empirical research where data are collected and hypotheses are stated. Then answer these questions: a. What types of reliability coefficients should be reported for instruments used in each of the five articles? b. How many of these articles discuss the reliability of the measures that are being used? c. If information about the reliability of the measures is not discussed, why do you think this is the case?

2. Dr. Stu has already created an ability personality test that he finds to be highly unreliable, and he knows that unreliability is usually due to method or trait error. Name three potential sources of each of these kinds of error and speculate on how they might be eliminated or decreased. 3. Why does reducing error increase reliability? 4. Why are reliability coefficients of 1.00 unrealistic to expect? 5. Here are some data on the same test that was administered at the beginning of a treatment program (for balance training in older adults) given in October and again in May, after 7 months of training. a. What kind of reliability coefficient would you establish and why? b. What’s your interpretation of the reliability coefficient? October Score

May Score

5

8

4

7

3

5

6

7

7

8

8

9

7

8

5

5

5

6

8

9

6. What does it mean to say that a test is internally consistent, and when might that not be important? 7. What’s the primary danger in using a test for research that’s not reliable?

63

64  Part I 

■ 

The Basics

8. Label each of the following errors as a trait error or a method error: a. The teacher forgot to put the correct answer in the set of alternatives in a multiple-choice question. b. The test taker was more focused on his recent breakup than on his performance on the test. c. The test taker arrived to the final exam 30 minutes late and had to rush through the exam. d. The construction occurring outside the test room window was loud and bothersome. 9. Apply the Spearman–Brown formula to find the corrected split-half reliability estimate for an initial split-half reliability coefficient of r = .68. What conclusion would you draw from the corrected estimate about the adequacy of the internal consistency of your test? 10. You are explaining your research results to your friend, and you say you found a test–retest reliability value of .68. What would you say when your friend, who has not taken any courses in tests and measurement, asks you what that number means?

Want to Know More? Further Readings •

Eckstein, D., & Cohen, L. (1998). The Couple’s Relationship Satisfaction Inventory (CR51): 21 points to help enhance and build a winning relationship. Family Journal of Counseling and Therapy for Couples and Families, 6(2), 155–158.

A hands-on example of using an instrument with established reliability. •

Winstanley, M. R. (2004). The relationship between intimacy and later parenting behaviors in adolescent fathers. Dissertation Abstracts International: Section B: The Sciences and Engineering, 64(11B), 5822.

This journal article reports on how the lack of reliability threatens the value of a study’s results.

And on Some Interesting Websites •

We may take for granted that medical tests are reliable just because they are part of a relatively authoritative discipline (medicine, health care, etc.). But do beware. Read about the reliability of medical tests at https://labtestsonline.org/understanding/features/reliability/.



And as a concept, reliability is also very important when applied to different areas, such as the reliability of child witnesses. Read more at http://abcnews.go.com/Technology/ story?id=97726&page=1.

And in the Real Testing World Real World 1 Here’s where we see the most application of the idea of reliability, as a first step in establishing the validity of a test. These authors evaluated the reliability of three acculturation instruments across

Chapter 3 

■ 

Reliability and Its Importance  

many different studies (that’s the meta-analysis part) and found that reliability estimates for all three instruments were high. However, they also found that reliability estimates are associated with scale length, gender, and ethnic composition of a sample. Want to know more? Huynh, Q.-L., Howell, R. T., & Benet-Martínez, V. (2009). Reliability of bidimensional acculturation scores: A meta-analysis. Journal of Cross-Cultural Psychology, 40(2), 256–274.

Real World 2 There are many measures of behavior for which few reliable tests exist, but the concepts and constructs are important to ensure. Such is the case with sexual behavior, especially among adolescents. In this study, the researchers documented the test–retest reliability of self-reported sexual health measures, including values, attitudes, and knowledge, among low-income Hispanic adolescents who lived in urban communities. They found that test–retest reliability estimates differed greatly across three samples of Hispanic adolescents but that scales showed good to excellent reliability, which suggests that Hispanic adolescents are relatively stable in their responses on these measures. Want to know more? Jerman, P., Berglas, N. F., Rohrbach, L. A., & Constantine, N. A. (2016). Test– retest reliability of self-reported sexual health measures among US Hispanic adolescents. Health Education Journal, 75, 485–500.

Real World 3 This reliability stuff is for real and has very practical and important applications. One of the most difficult problems in gang research is the search for valid and reliable measures. Two professors provide an assessment of the reliability and validity of measures of gang homicide using police and survey reports collected from different sources from 2002 to 2006. Given public and political claims about the role of gangs in crime, having reliable (and valid) measures of gang-related crime is imperative. Want to know more? Decker, S. H., & Pyrooz, D. C. (2010). On the validity and reliability of gang homicide: A comparison of disparate sources. Homicide Studies, 14, 359–376.

65

4 VALIDITY AND ITS IMPORTANCE The Truth, the Whole Truth, and Nothing but the Truth Difficulty Index ☺ ☺ (right there with Chapter 3—a bit tough)

LEARNING OBJECTIVES After reading this chapter, you should be able to • Define validity the way professionals do. • Explain content-based and criterion-based arguments for the validity of a test score. • Argue that all validity is construct-based validity and describe ways of evaluating construct-based validity. • Apply some basic advice on how to think about validity and the difficulty of creating valid measures.

V

alidity is simply the property of an assessment tool that works the way it is supposed to. And if a test is valid, then test scores have meaning. If a test is not valid, then what possible meaning can we attach to outcomes produced by it? (None, no meaning, zilch.) For example, if an achievement test is supposed

67

68  Part I 

■ 

The Basics

to measure knowledge of history and it is a valid test, then that’s what it does: It measures knowledge of history. The score on the test represents the construct of interest—history knowledge. If an intelligence test is supposed to measure intelligence as defined by the test’s creators and it is a valid test, then the scores represent intelligence. Our hands-on, real-life examples come from John Govern and Lisa Marsh, who developed and validated the Situational Self-Awareness Scale (or SSAS— measurement-types love initials), which is a measure of self-awareness. The authors conducted five studies to assess the validity of the scale using 849 undergraduates as participants and found that the scale detected differences in public and private self-awareness. In other words, it does what it says it does! Want to know more? Govern, J., & Marsh, L. (2001). Development and validation of the Situational Self-Awareness Scale. Consciousness and Cognition: An International Journal, 10(3), 366–378.

A BIT MORE ABOUT THE TRUTH Establishing the validity of a test is a whole different ball game from establishing its reliability. The primary reason for this is that, as we discussed in Chapter 3, you can use concrete numbers to quantify the amount of reliability in the scores produced from a test. With validity, this is just not the case. There are some quantitative indicators of validity (in fact we mention some in the pages to come). Sometimes statistics are used to support a claim about the validity of a measure, but we’ll leave those for the next tests and measurement course you take. In this chapter, we focus on the conceptual arguments for validity. And because we can’t attach a number to the notion of validity very easily, we don’t talk about tests either being valid or not, but rather we speak of the degree of validity along a continuum. Tests can have some validity, but it’s not enough to be useful. Or we might reach the opposite conclusion in which tests can have some validity, and it is enough for them to be useful. It’s all about the context and the intended purpose of the test. Let’s get more technical about a definition of validity. The several governing bodies that have guidelines about the development of tests (such as the American Psychological Association and the National Council on Measurement in Education) have this as the general definition of validity: the extent to which inferences made from it are appropriate, meaningful, and useful. Notice that to measurement scientists, validity is all about how one interprets a score. A bit wordier, but it conveys the general message that a valid test does what it’s supposed to do.

Chapter 4 

■ 

Validity and Its Importance  

For example, if you design a physics test that covers the laws of thermodynamics (are we having fun yet?), what evidence might you collect, or what theoretical argument might you make, to be able to claim that this test has validity. There are lots of ways you might find evidence of validity. For instance, you might use the following strategies: • You find that several physicists who are experts in the topic agree that the questions you created cover the important topics in the field. • You can show that how people score on some other physics test (that also measures knowledge of thermodynamics) is similar to how they score on your test. The scores are positively correlated. • The definitions of important thermodynamics concepts are reflected in how you define those concepts in your questions on the test. Here’s another example, this time for a personality scale meant to assess levels of aggression. What could you do to provide evidence that this test is valid? Consider these arguments: • Aggression is sometimes identified by psychologists as being of two types: instrumental aggression, where the person benefits from the aggression (such as stealing Bruce’s ice cream cone) and hostile aggression, where there is no obvious benefit (such as knocking Bruce’s ice cream cone out of his hand). You could demonstrate that there are an equal number of questions on your aggression scale from each of the two types of aggression. • Measure inmates as they first enter the prison system and correlate those scores a year later with the number of infractions they have committed. • Observe previously identified aggressive people, and notice that they score higher on your scale than random people who have not been identified as being aggressive. This suggests a link between the scores on your scale and the construct of aggression. We’ll attach formal names to these procedures and the type of validity they describe later in this chapter, but for now remember this: A test is valid when it does what it was designed to do.

Reliability and Validity: Very Close Cousins Now that you have some idea of what validity is, it’s a good time to mention the very special relationship between reliability and validity. Always remember that reliability is about how much randomness there is in a set of test scores and validity is about how well those scores reflect whatever the test is supposed to measure. These are different concepts.

69

70  Part I 

■ 

The Basics

We mentioned in Chapter 3 that even smart people confuse reliability and validity, though they are different concepts. And that is true. But there is a strong relationship between the two characteristics. Most simply put, a test cannot be valid unless it is reliable. Think about it. Reliability is the quality of a test being consistent, right? And validity is the quality of a test doing what it is designed to do. How can anything do what it is supposed to do (validity) if it cannot do what it is supposed to do consistently (reliability)? It can’t. If scores are random, they measure nothing. Scores that measure nothing cannot measure something. Non-reliable scores cannot be valid. On the other hand, and this is what sometimes gets confusing, a test can be very reliable, but be measuring the wrong thing! So, it is certainly possible that reliable tests are not measuring what they are supposed to. Reliable tests are not always valid. But valid tests are always reliable.

DIFFERENT TYPES OF VALIDITY ARGUMENTS Just as there are different types of reliability, there are different types of validity. Technically, actually, these aren’t different types of validity, because validity is what measurement folks call a unitary concept. All the evidence is evaluated together to reach some conclusion about the “amount” of validity in a set of scores. But there are different categories of validity evidence and we’ll cover the three most common types of arguments in this chapter. They are all summarized in Table 4.1. TABLE 4.1 

 Different Types of Validity Arguments and How They Work

Type of Validity Evidence

An Example of What You Can Say When You’re Done

When You Use It

How You Use It

Content-based validity

When you want to know whether the items on a test are a fair representation of the items that could reasonably be on the test

Examine the content very closely and be sure the questions cover the topic or construct well.

My weekly quiz in my stats class is fair because it covers what was in the readings.

Criterion-based validity

When you want to know if a test’s scores are related to the scores people would get on some other measure or criterion that measures the same thing or something similar

Correlate the scores from the test with some other measure that is already valid and assesses the same construct.

College admissions tests correlate with college grade point averages (GPAs) at the end of students’ freshman year.

Construct-based validity

When you want to know if the scores on a test directly reflect the underlying psychological construct or invisible trait that a person wishes to measure

See if the scores behave in ways that the theoretical definition of the construct expects.

Clients who have been diagnosed with depression score higher on the Beck Depression Inventory than clients who have been diagnosed with anxiety.

Chapter 4 

■ 

Validity and Its Importance  

Content-Based Validity Content-based validity is the property of a test such that the test items fairly sample the universe of items that could have been on the test. Content validity is most often used for achievement tests, everything from your first-grade spelling test to the SAT, and tests of minimal competency, like licensing exams, but sometimes it makes sense for psychological assessments, too. Any time there is a concrete list or structured domain of topics and subtopics, one can make a content validity argument that questions or tasks on an assessment were drawn from that pool of potential questions. Sometimes, test developers make a table of specifications, which lists all the constructs, domains, or topics that should be on a test and how much coverage each area should get. Establishing Content Validity Establishing content validity is really a matter of answering the following question: Does the collection of items on the test fairly represent all the possible questions that could be asked? Some tests and measurement specialists think that content validity is nothing other than a sampling issue, the sort of concern researchers have. Just as you want to select representative participants for a research study, test developers ask how well did we select items for the test that are representative of all the possible items. Let’s use that physics test we mentioned earlier as an example. Imagine you are creating a final exam for a Physics I class and you want to evaluate the test to ensure that it meets the standards of content validity. One thing you can do is map out and then define the amount of time you spend covering each topic (such as terms or the laws of thermodynamics, and so on). The number of items on the test should reflect the amount of time spent teaching each topic and you could create a table of specifications that details this precisely. In theory, you will be creating a test that accurately reflects the universe of knowledge from which these items can be drawn. (And you’ll be ready to accurately answer the common question from students, “Will this be on the test?”) Want to get fancy? Remember that as good scientists, we are very interested in providing data that support our conclusions and such. So wouldn’t it be grand to have some quantifiable measure of content validity? It would be, and here it is. C. H. Lawshe, the vocational psychologist, invented one such measure called the content validity ratio. A set of judges decides whether each question on a test is essential, useful but not essential, or not useful at all to the performance of the job or skill under examination. Then the data are entered into this equation: CVR =

ne − N 2 N 2

71

72  Part I 

■ 

The Basics

where CVR = the content validity ratio ne = the number of judges who selected the essential questions N = the total number of judges So for each item, the CVR is computed. For example, if there were 10 judges and 5 of them judge an item as essential (perhaps the criterion you want to use), then CVR would equal CVR =

10 2 =0 10 2

5−

So any value less than 0 means that there is less than adequate agreement that the item is essential to the job or skill for which it is intended. This ratio pertains to each item’s usefulness and by examining all the items you can get a good idea about how fairly the items on the test represent the items that should be on the test. And that, our friend, is the very definition of content validity. Appearances can be misleading. Sometimes, you’ll see the term face validity used synonymously with content validity. Nope—not the same. Face validity is claimed to be present if the items on a test appear to adequately cover the content or if a test expert thinks they do. Kind of like, “Hey, Albert (or Alberta), do you think this set of 100 multiple-choice items accurately reflects all the possible topics and ideas I would expect the students in my introductory class to understand?” In this context, face validity is more like “approval” validity. It’s the general impression that the test does what one thinks it should. The important distinction is that face validity is more or less a social judgment rendered by some outside person (even if an expert) without the application of a table of specifications, or some other structured list, such as the type we discussed for content validity. So, promise us, that you won’t go around showing off by saying “face validity” like it is as important as the types of validity evidence we focus on in this chapter. In the Standards for Educational Psychological Testing, content validity is also discussed within the framework of evidence based on test content (which should not come as any surprise). More about that when we talk about standards in Chapter 16.

Chapter 4 

■ 

Validity and Its Importance  

Criterion-Based Validity Criterion-based validity assesses whether a test measures the same construct as some other test that could be given now or in the future. If the criterion can be given concurrently, in the here and now (around the same time or simultaneously), we call this type of criterion validity concurrent criterion validity or simply concurrent validity. If the criterion won’t be available until the future, we talk about predictive criterion validity, or just predictive validity. Criterion validity is most important for tests that either predict the future—like college admissions tests or aptitude tests about whether you’d be good at some job you’re applying for—or are meant to be used instead of some other test—like a shorter or cheaper version of a long or expensive standardized test or a screening instrument given in medical settings to see who needs to take some more accurate (and possibly intrusive) medical test. To establish criterion validity, one need not establish both concurrent and predictive validity—only the one that works for the purposes of the test. Remember validity has to do with how a test is intended to be used. Some tests, but not most, are intended to be used to predict future performance, such as the Graduate Record Exam (GRE), which predicts how students will perform in graduate school. These tests, of course, should demonstrate good predictive validity in order to be useful. A quick survey in a women’s magazine about whether you are a good friend wouldn’t be expected to come along with predictive validity evidence. In tests and measurement—and statistics and almost every other discipline—there is usually more than one term or phrase to represent the same idea. In the case of criterion validity, it is sometimes called criterion-related validity. Same thing— no worries. This is a reminder that these aren’t really types of validity, just different aspects of what it means for a test to work as it is intended to work. Establishing Concurrent Validity Imagine you’ve been hired by the Universal Culinary Institute to design an instrument that students take when they graduate, perhaps for certification or licensing by a national board. Some part of culinary training has to do with basic knowledge, such as “What’s a roux?” (It’s a mix of flour and fat used to make sauces, of course!) That part of the test is basically an achievement test. Another part of the test is designed to measure culinary skills (such as knife technique, pastry crust creation, and so on—hungry yet?) and it’s a performance-based assessment where trained judges observe students doing cooking stuff and rate them. This part of the test is called the Cooking Skills scale. The Cooking Skills scale is a set of 5-point items across a variety of criteria (presentation, cleanliness, etc.) that each judge will use. Imagine you want to establish

73

74  Part I 

■ 

The Basics

concurrent criterion validity. The first step is to choose a criterion against which to compare scores on the Cooking Skills scale. As a criterion, you could use ratings from all the faculty based on their own judgment of each student’s ability. This approach to student evaluation is pretty common, so it would work well as the external criterion—it is trusted and accepted as a valid way to do things. (It probably isn’t, but that’s another story.) So, the average instructor’s rating is our criterion. Then you simply correlate the Cooking Skills scores with faculty ratings. If this “validity coefficient” is high (e.g., something like .70), you’re in business; if not, it’s back to the drawing board. In this example, you will have established concurrent validity by the very nature of the criterion being closely related to what you want your test to measure. The coefficient of determination is a way to interpret the value and strength of correlation coefficients, and you may have seen these terms used in your basic statistics class. Well, the coefficient of determination is the correlation coefficient squared and is interpreted as the amount of variance accounted for in one variable by the other. This is a very useful tool for understanding criterion validity. The validity coefficient for criterion validity can be squared and we get some idea what the strength (and importance) of the relationship is between our test and the criterion. In the case of the Cooking Skills scale, if the correlation between the Cooking Skills score and the faculty ratings is .70, then r 2 = .49, or 49% of the variance in Cooking Skills scale scores overlaps with faculty judgment—a useful bit of information to know about a test and also useful to see how much of a variable is not accounted for! In some ways, validity is a pretty straightforward concept. If a test does what it should, then it’s valid. But the threats to validity go way, way beyond this simple idea. The presence of validity can also be thought of as a very broad foundational requirement for good measurement, affected by a host of variables such as bias, ethics, and the wide-ranging social and legal implications that surround testing and the testing establishment (those folks who design, manufacture, and sell tests). Each one of these topics deserves a book in itself, but we’ll provide a brief overview in Chapters 15 and 16. Stay tuned. Establishing Predictive Validity Let’s say that the cooking school has been percolating (heh-heh) just fine for 10 years, and you are interested not only in how well people cook (and that’s the concurrent validity part of this exercise that we just did), but also in the predictive validity of the Cooking Skills scale. In other words, how well does your test predict cooking success later on? Now the criterion changes from a here-and-now score (the rating the faculty give) to one that looks to the future. We are interested in how well scores on the Cooking Skills scale predict success as a chef years down the line (heh-heh). Now to do that, of course (because we are

Chapter 4 

■ 

Validity and Its Importance  

exploring predictive validity), we would need to locate graduates of the program who have been out cooking for a while, say 10 years, measure their level of success somehow, and look at their Cooking Skills scores. By correlating the two scores we would have predictive validity evidence. So, let’s wait 10 years. . . . Okay, now let’s do our study. We have lots of choices for how to get a score for their current success. We could check whether the graduates own their own restaurants and whether they have been in business for more than a year (given that the failure rate for new restaurants is more than 70% within the first year). The rationale is that if a restaurant is in business for more than a year, then the chef must be doing something right. We might give them a 1 if the answer is Yes and a 0 if their answer is No. (This would have some construct validity as a scoring system if you buy our theory that running a successful restaurant means you are a good chef.) The disadvantage of this scoring system, though, is it is at a low level of measurement. In Chapter 2 (which we read 10 years ago, so you may have forgotten), we realized that scores are much more informative if they are at the interval level, lots of scores possible along a continuum and the distances between scores has equal meaning all the way along that continuum. A Yes/No scoring system is ordinal and doesn’t give us as much information to distinguish among all the chefs. So, let’s pick an interval-level system. One way would be to count the number of days that each graduate has been employed as a head chef. Then scores would range from 0 to 3,650 (10 years × 365 days a year = 3,650). That’s even at the ratio level of measurement, better than interval! See Chapter 2 for more about all this levels of measurement stuff. To complete this exercise, you correlate the Cooking Skills score from 10 years ago with the number of days employed. A high correlation indicates predictive validity, and a low correlation indicates the lack thereof. (By the way, these validity coefficients can be negative, it all depends on how you happen to score your variables.) Hmmm . . . About That Criterion It’s probably obvious to you by now that the key to establishing either concurrent or predictive validity (or criterion validity of any kind) is the quality of the criterion. If the criterion is not an accurate and meaningful reflection of what you want to be sure you are measuring, then the correlations mean nothing, right? For example, if you wanted to test the concurrent validity of the Cooking Skills scale, you would not use some general measure of ability, such as an intelligence test, because you wouldn’t expect a big relationship between that and the skills you learned in culinary school. So what makes a good criterion, and how do you find one? First, as usual and as Ben Franklin surely said at one point, common sense doesn’t hurt. Criterion validity is often used to establish the validity of aptitude and performance tests, and it does not take a rocket scientist to determine what set of skills

75

76  Part I 

■ 

The Basics

might be related to those being tested. If you’re interested in looking at the concurrent validity for a test of secretarial skills, criteria such as personality, organizing skills, and efficiency (all defined in one way or another) would seem to fit fine. Or if you are interested in the predictive validity of an aptitude test for teaching, then reliable and valid teacher ratings might work just fine as well. Second, there’s almost always a massive amount of literature on a particular ability or trait or performance skill that you can find in your discipline’s journals in the library. If an article discusses how important spatial skills are to mechanical engineering, then you might move toward an established test of spatial skills (such as the Minnesota Spatial Relations Test, published by American Guidance Service, Inc.). Here’s where having a good understanding of your subject matter comes in very handy. Third, any criterion you select should (of course) be reliable, as we discussed in Chapter 3. The scores should not have much randomness in them and produce precise scores. Finally, and this seems counterintuitive, your criterion should not be too similar to what you are validating. This is called criterion contamination and the idea is that if you are just measuring the same thing twice, then, of course you would get a big correlation. The contamination idea is for situations where the actual score on your test directly affects the score on your criterion. For example, if you have judges rating contestants in a musical competition, and they know what the scores were on the preliminary screening done by the producers, that could affect their judgments. So, a study validating the screening system would find high correlations with judges’ later evaluations and that wouldn’t really speak to whether that screening is any good. In the Standards for Educational Psychological Testing, criterion validity is discussed within the framework of evidence based on relationships with other variables. More about that in Chapter 16.

Construct-Based Validity Construct-based validity refers to evidence that the score on a test actually represents the invisible abstract trait that you are trying to measure. In education, psychology, marketing, and other social sciences, test developers often are hoping to measure a complex concept that is invisible to the naked eye. Think about this! Researchers routinely measure intelligence, depression, motivation, attitude, anxiety, loyalty, love, hate, and hundreds of constructs that are hypothesized to exist but can’t be seen. These constructs are defined by theory only, and we are able to see them (or assume we can) by measuring them and slapping a score on them! This is remarkable and an impressive triumph for the science of psychometrics. We take it

Chapter 4 

■ 

Validity and Its Importance  

for granted that we can measure these variables, but take a minute to appreciate the difficulty in establishing that we are really tapping into these invisible constructs. A construct (pronounced CON-struct) is a complex variable that we cannot see. For example, aggression is a construct (consisting perhaps of such variables as inappropriate physical contact, violence, lack of successful social interaction, etc.), as is mother–infant attachment, and hope. And keep in mind that constructs are always generated from some theoretical position that the researcher assumes. That’s really important. Establishing Construct-Based Validity Though construct-based validity arguments are important for all tests and measures, they are particularly important when the construct being measured is abstract and hard to “see” in concrete ways. For instance, let’s examine the (imaginary) Salkind-Frey FIGHT scale in development, which is a self-report, paper-and-pencil tool that consists of a series of items and is an outgrowth of a theoretical view about what the construct of aggression consists of. We know from our (imaginary) extensive review of the criminology literature that people who are aggressive do certain types of things more than people who are not aggressive; for example, they get into more arguments, they are more physically aggressive (pushing and such), they commit more crimes of violence against others, and they have fewer successful interpersonal relationships. There are also several theories about aggression—why humans are aggressive, the purpose it serves, what causes it, why some people are more aggressive than others, and so on. Items on the scale were written to match the theoretical definition of the construct and what aggression “looks like” according to research. For example, one theory of aggression is a classic from old-timey psychiatrist Sigmund Freud and defines aggression as a reaction to blocked impulses—humans want something and if they cannot get it, they behave aggressively. Using this theory as a guide to write questions for our measure, we might have items asking about how people feel when their impulses are blocked. We don’t write questions like “How do you feel when your impulses are blocked?” but instead they are more concrete, such as “How do you feel when you have come close to getting something you really wanted but ultimately failed?” The idea is, though, that the nature of the test and its questions are chosen to match the theory of aggression. This will ensure that our measure has construct validity. Notice that the evidence for validity in this example isn’t data that was collected or correlations or research studies, but the “evidence” is an argument, an appeal to reason—if we write questions that were guided by a well-defined theory, then it is reasonable to assume that our questions have construct validity. Congratulations to us, with that argument alone, we’ve made progress toward establishing construct validity! In real life, construct validity arguments do include all that other data analysis stuff, but the core evidence for construct validity claims is usually an appeal to theory.

77

78  Part I 

■ 

The Basics

But wait—there’s more! The Tough (but Fascinating) One: The Multitrait–Multimethod Way of Establishing Construct Validity Our previous example gives a simple and elegant way of establishing construct validity, without doing any data collection, but there are other powerful ways of exploring construct validity that do include data collection. One ingenious way that you’ll read about in your studies and deserves a bit of time here is the multitrait–multimethod matrix, developed by Julian Campbell and Donald Fisk in 1959. Warning—this is not terribly easy stuff, but just spend some time thinking about the following explanation as you work through Figure 4.1. This technique is the Rolls-Royce of establishing construct validity, so taking a little extra time to understand it will be well worth it. The multitrait–multimethod matrix is the top-of-the-mountain way of establishing construct validity, but it takes a good deal of time and resources to see it through to completion. In the multitrait–multimethod technique, you measure more than one trait using more than one method and then look for certain relationships (between methods and traits that support your ideas). For example, let’s say you want to establish the construct validity of our FIGHT scale and you measure several related variables using the following different methods: 1. An observational tool 2. A self-report tool (the FIGHT scale) 3. Teacher ratings And let’s say you want to measure the following three constructs: 1. Aggression (the FIGHT trait) 2. Intelligence 3. Emotional stability As you can see in Figure 4.1, you can give the stack of different measures to one big sample of people and compute a bunch of correlations among all the traits and all the methods. The thinking is this: If the FIGHT scale really works as a measure of aggression (that is, if it has construct validity), here’s what we would expect: • The lows (the plain, also called Roman, type in Figure 4.1) represent different methods being used to measure different traits. You’d expect

79

Teacher Ratings

Self-Report

Observation

FIGURE 4.1 

Low Low

Emotional Stability Low

High

Low

High

Aggression

Intelligence

Low

Low

Emotional Stability

High

Low

Low

High

Low

High

Low

Intelligence

Low

Low

High

Aggression

(Very High)

Moderate

Moderate

Emotional Stability

(Very High)

Emotional Stability

Moderate

(Very High)

Intelligence

Intelligence

Aggression

Aggression

Observation

Low

Low

High

Moderate

Moderate

(Very High)

Aggression

Low

High

Low

Moderate

(Very High)

Intelligence

Self-Report

High

Low

Low

(Very High)

Emotional Stability

 Correlations in a Multitrait–Multimethod Matrix If Construct Validity Is Present

Moderate

Moderate

(Very High)

Aggression

Moderate

(Very High)

Intelligence

Teacher Ratings

(Very High)

Emotional Stability

80  Part I 

■ 

The Basics

these to be very low, because they share very little in common—no trait or method similarities. These correlations represent discriminant validity—a kind of reverse criterion validity demonstrating that test scores do not correlate with tests they shouldn’t. • The moderates (in the shaded cells in Figure 4.1) represent the same methods being used to measure different traits. You’d expect these to be moderate, because they have the same method in common, but not big high because they are not measuring the exact same constructs. • The highs (appearing in bold and italics in Figure 4.1) represent correlations between different methods measuring the same trait—and these are all very important validity coefficients. Here’s where a high value validates the use of a new method (such as the FIGHT scale) with an existing one that has already been validated through different means. These correlations represent convergent validity—the term for when scores correlate with each other as expected. • The very highs (appearing in parentheses in Figure 4.1 along the diagonal of the set of correlation coefficients) represent correlations between the same method measuring the same trait. We’d expect these to be very high, right? The whole logic behind this technique is based on the fundamental assumption that the correlations between two methods being used to measure the same trait are higher than the correlations where the same method is used to measure different traits (the moderates in Figure 4.1). In other words, regardless of the method being used (the written, perhaps selfreport, FIGHT scale, or an observational tool, for example), the scores are similar. And that, coupled with high correlations between different methods measuring the same trait (the bolded validity coefficients in Figure 4.1), gives you strong evidence that the FIGHT scale has construct validity. All Validity Is Construct Validity You may recall from earlier in this chapter (which may be 10 years ago, depending how literally you followed our instructions) that modern measurement theory sees validity as a unitary concept—that there aren’t really different types of validity, just one big validity. Because validity is defined by the pros as the degree to which accumulated evidence and theory support a specific interpretation of test scores for a given use of a test, all the conceivable arguments and evidence, whether it is content-based or criterion-based or theory-based, form a single coherent body of evidence to support a claim of validity. Or to support a conclusion of the absence of validity. That unitary validity—the “one ring to rule them all”? That’s construct validity. (Bruce is now explaining the Tolkien reference to Neil.)

Chapter 4 

■ 

Validity and Its Importance  

And If You Can’t Establish Validity . . . Then What? Well, this is a tough one. Many validity arguments do not require any data collection or statistical analyses, such as content-based strategies and construct-based arguments. So, in general, if you don’t have the validity evidence you want, it’s because your test is not doing what it should. If it’s an achievement test and a satisfactory level of content validity is what you seek, you probably have to redo the questions on your test to make sure they are more consistent with what they should be, according to that expert or those state educational standards or your table of specifications. If you are concerned with criterion validity, then you probably need to reexamine the nature of the items on the test and answer the question of how well you would expect these responses to these questions to relate to the criterion you selected. And of course, you have to examine the reliability and usefulness or relevance and validity of the criterion. And finally, if it’s construct validity you are seeking and can’t seem to find, better take a close look at the theoretical rationale that underlies the test you developed and the items you created to reflect that rationale. Perhaps your definition and theoretical model are underdeveloped (a euphemism for not very good), or perhaps they just need some critical rethinking.

A LAST FRIENDLY WORD Now that we are at the end of our two chapters on reliability and validity, here are some friendly words of advice. There’s a great temptation for undergraduate students working on their honors theses or semester projects, or graduate students working on their theses or dissertations, to design an instrument for their final project. This attempt should spell DANGER, loud and clear. This may not be such a good idea, for the simple reason that the process of establishing the reliability and validity of any instrument can take years of intensive work (which is reason enough to leave it alone for now). And what can make matters even worse is when the naive or unsuspecting individual wants to create a new instrument to test a new hypothesis. On top of everything else that comes with testing a new hypothesis, there is also the work of making sure the instrument works. On the other hand, it is possible that you need to measure a brand new construct that you just invented or a trait in a very specific context or a narrow range of classroom behaviors that you need to observe. In these cases, it may be best, scientifically, to create your own instrument. For instance, in educational research and marketing research, and other fields, too, students frequently write their own surveys and questionnaires and attitude scales. Good students with

81

82  Part I 

■ 

The Basics

the help of good advisers can create nice reliable and valid measures when they need to. Chapter 14 teaches you the basics of that process. It is a Herculean task to design anything other than the most simple of tests for use in your research. So if you propose to do such, get ready for lots of questions as to why you don’t want to use an existing test. If you are doing original research of your own—such as for your thesis or dissertation requirement—be sure to find a measure that already has wellestablished reliability and validity. That way, you can get on with the main task of testing your hypothesis and not fool with the huge task of instrument development—a career in and of itself. You’ve read about the relationship between reliability and validity several places in this chapter, but there’s a very cool relationship lurking out there that you may read about later in your course work and you should know about now. This relationship says that the maximum level of validity (such as that measured by one of the coefficients we talked about) is equal to the square root of the reliability coefficient. For example, if the reliability coefficient for establishing the test–retest validity for a test of mechanical aptitude is .87, the validity coefficient (the correlation of a test with some other test) can be no larger than .93 (which is the square root of .87). What this means in tech talk is that the validity of a test is constrained by how reliable it is. And that makes perfect sense if we stop to think that a test must do what it does consistently before we can be sure it does what it says it does. Remember that scores that are random don’t mean anything, by definition. Another cool thing to learn more about in your next measurement course is that this known mathematical relationship between reliability coefficients and correlation coefficients allows researchers to see what the true relationship among variables would be if both variables were measured using tests with perfect reliability! This is called correction for attenuation.

Summary This has been an interesting tour through the world of validity (following our discussion on reliability), so you now have a good idea regarding the value of both precision and validity when it comes to understanding tests and how they are constructed and used. Our next task takes us to understanding test scores, and how statistics are used to examine them, the focus of Chapter 5.

Time to Practice 1.

Go to the library and find five journal articles in your area of interest where reliability and validity data are reported, as well as the outcomes measures used. Identify the type of reliability that was established, identify the type of validity, and comment on whether you think the levels are acceptable. If not, how can they be improved?

Chapter 4 

■ 

Validity and Its Importance  

2. Provide an example of how you could establish the construct validity of a test of shyness. 3. Provide an example of how you establish concurrent validity and predictive validity of a test of marital satisfaction. 4. Your classmate forgot to read Chapter 4. In his research on a new instrument, he reports a reliability coefficient of .49 and a validity coefficient of .74 correlation with another test that measures the same thing, and he describes these results as an indication that his test demonstrates adequate validity. What is wrong with this statement? 5. When testing any experimental hypothesis, why is it important that the test you use to measure the outcome be both reliable and valid? 6. You’re smart, right? Why should you not spend your dissertation or thesis time developing a test? 7. Why is the multitrait–multimethod way of establishing construct validity so clever in its design and execution? 8. Why pay so much attention to the selection of the criterion when trying to establish criterion validity? 9. Okay, you’ve developed your test—and congratulations, first big hurdle overcome. But despite your best efforts to establish validity, no deal. What’s your next step?

Want to Know More? Further Readings •

Edwards, W. R., & Schleicher, D. J. (2004). On selecting psychology graduate students: Validity evidence for a test of tacit knowledge. Journal of Educational Psychology, 96(3), 592–602.

These researchers looked for evidence for the criterion-related validity of a measure of tacit knowledge as a way to select graduate students for advanced study in psychology. •

Supple, A. J., Peterson, G. W., & Bush, K. R. (2004). Assessing the validity of parenting measures in a sample of Chinese adolescents. Journal of Family Psychology, 18(3), 539–544.

These investigators found that measures of negative parenting that included physical or psychological manipulations may be relevant for understanding the development of Chinese adolescents.

And on Some Interesting Websites •

Lots more on the very cool multitrait–multimethod matrix, brought to you by William M. K. Trochim at http://www.socialresearchmethods.net/kb/mtmmmat.php.



A detailed discussion of validity “types” written by Hamed Taherdoost is available at https:// www.researchgate.net/publication/319998004_Validity_and_Reliability_of_the_Research_ Instrument_How_to_Test_the_Validation_of_a_QuestionnaireSurvey_in_a_Research.

83

84  Part I 

■ 

The Basics

And in the Real Testing World Real World 1 Good social and behavioral scientists always care about the reliability and validity of the instruments they use. And if the reliability and validity aren’t apparent, they attempt to test them, such as in this study where the goal was to validate the construct validity of a new version of a Physical Activity Scale (PAS) for measuring average weekly physical activity of sleep, work, and leisure time. The validation effort took place through interviews and was achieved by assessing agreement scores obtained from average weekly physical activity measured by the PAS and a 24-hour PAS, previously found to overestimate physical activity. The agreement was high, and some minor language changes were made in the scale to better accommodate the user. Want to know more? Andersen, L. G., Groenvold, M., Jørgensen, T., & Aadahl, M. (2010). Construct validity of a revised Physical Activity Scale and testing by cognitive interviewing. Scandinavian Journal of Public Health, 38(7), 707–714.

Real World 2 We talked a lot about several types of validity in this chapter, but researchers are always coming up with terms that reflect these basic concepts but take them one step further. Jessica Kramer from Boston University defines social validity as the extent to which procedures, goals, and outcomes are acceptable and important to end users. This study features a process of triangulating multiple methods to evaluate the social validity of a self-report assessment, the Child Occupational SelfAssessment, a self-report of everyday activities for children with disabilities. This is another way to lend validity to assessment outcomes—showing how powerful and far-reaching the concept of validity can be. Want to know more? Kramer, J. S. (2011). Using mixed methods to establish the social validity of a self-report assessment: An illustration using the Child Occupational Self-Assessment (COSA). Journal of Mixed Methods Research, 5(1), 52–76.

Real World 3 Here’s yet another type of validity that refers to the value of an instrument in choosing a given treatment or intervention in education or psychology. Treatment validity is used to link data to the selection of interventions to be used, as well as to make decisions about treatment length and intensity. The work by these researchers reviews the treatment validity of current autism screening instruments and attempts to link their content to the outcomes research in autism to identify the items that can help design effective interventions. Want to know more? Livanis, A., & Mouzakitis, A. (2010). The treatment validity of autism screening instruments. Assessment for Effective Intervention, 35, 206–217.

5 SCORES, STATS, AND CURVES Are You Hufflepuff or Ravenclaw? Difficulty Index ☺ ☺ (Warning: There is math ahead, and this chapter is one of the longest we have!)

LEARNING OBJECTIVES After reading this chapter, you should be able to • Distinguish between norm-referenced and criterion-referenced scores. • Calculate and interpret percentile ranks. • Calculate and interpret z scores. • Calculate and interpret T scores. • Calculate and interpret the standard error of measurement. • Calculate and interpret averages and the standard deviation. • Identify the proportion of scores that fall within different areas of the normal curve.

I

t seems to be part of the testing thing. Almost all of us get very anxious about our test scores. We can’t wait to get that paper back, check the listing of grades

85

86  Part I 

■ 

The Basics

posted on the professor’s door (not by name or student number—see Chapter 16), or check the class website to find out if we got the grade we wanted. But test scores and grades and assessment results come in all shapes and sizes. A score of 5 is not that great on a 50-question, multiple-choice history test, but it’s the highest anyone can get on the Advanced Placement Writing Test offered by the Educational Testing Service. So a 5 is not always a 5 is not always a 5. And being in the 90th percentile is a pretty high relative score (and you’ll learn about percentile ranks in this chapter) and sounds acceptable, unless the skill you’re trying to master is so important (like being an astronaut) that anything short of the 99th percentile might not even be considered good. So we have all these different types of test scores, and we need to know what they mean and how we can use them. That’s what this chapter is all about. We will talk about how to understand a score, ways to transform a raw score magically into something more informative, and how researchers use scores to explore relationships among variables. For much of this chapter, we’ll use the data set you see in Table 5.1. These are scores representing the number correct on a 20-item test for 50 different people. TABLE 5.1 

 Number Correct on a 20-Item Test for 50 Different People

ID

Name

Raw Score

ID

Name

Raw Score

 1

Mark

12

26

Sabey

17

 2

Aliyah

20

27

Malik

11

 3

Juan

17

28

Maria

14

 4

Millman

 8

29

Carlos

19

 5

Karrie

15

30

Keith

13

 6

Luis

 5

31

Mark

 6

 7

Annette

15

32

Pam

 9

 8

Josh

 5

33

Sofia

16

 9

Duke

15

34

Zion

12

10

Dave

 8

35

Adam

19

11

Jayden

19

36

Stu

18

12

Leni

15

37

Nancy

17

13

Sara

 8

38

Jada

 8

14

Micah

11

39

Suzie

16

Chapter 5 

■ 

Scores, Stats, and Curves  

ID

Name

Raw Score

ID

Name

Raw Score

15

Pepper

14

40

Jan

12

16

Trinity

16

41

Kent

 5

17

Mariana

 7

42

Annette

 5

18

Rachael

20

43

Fatima

 9

19

Amir

17

44

Joaquin

 6

20

Nevaeh

11

45

Deborah

 5

21

Bella

19

46

Ignacio

16

22

Max

16

47

Camila

19

23

Valentina

14

48

Xavier

 9

24

Dave

19

49

Ann Marie

 9

25

Lori

 7

50

John

 5

THE BASICS: RAW (SCORES) TO THE BONE! Let’s start with the simple stuff. A raw score is the score you observe. It is that observed score we talked about in Chapter 3 as part of Classical Test Theory. It is the original and untransformed score before any operation is performed on it or anything is done to it. It is what it is what it is. The scores you see in Table 5.1 are raw scores. Raw scores by themselves actually don’t tell us much without knowing a bunch of stuff about the test, such as what is good or bad, what is high or low, what are the variety of scores possible, and so on. But they form the basis for other scores that tell us a lot, such as percentile ranks and other standard scores—and those scores are very important for interpreting performance. Raw scores, by themselves, are essentially meaningless. Even if we know that someone got 18 out of 20 questions correct, we have no idea how difficult the questions are, what type of questions were used, whether the test was timed or not, and a wealth of other factors that provide us with the kind of information we need to make decisions about performance. For example, in Table 5.1, you can see that Stu (ID 36) got a raw score of 18 and Lori (ID 25) got a raw score of 7. We know that there were 20 items on the test, but we don’t know much more. In some instances (such as reaction time for an airline pilot or a golf score or the number of parking tickets you’ve received), a lower score is better, whereas in others (number of correct responses on a test), a higher score is probably better. Next, let’s distinguish between two types of scores and two ways to interpret scores.

87

88  Part I 

■ 

The Basics

One type of score is norm-referenced scores. Scores that are meant to have a norm-referenced interpretation add meaning to performance by comparing your performance to how other people did. For instance, was your score above or below average? Scores such as percentile ranks (or percentiles), and other standardized scores like z scores (whatever those are), are norm-referenced scores. A norm-referenced interpretation for Stu would be “He scored much better than the majority of the other 49 test takers.” Norm-referenced scores compare individual scores to what is normal or typical. For professionally developed tests given to lots of people, norms are often established to aid in interpretation. Norms (which describe the expected distribution of scores) are identified using research studies that have been administered to a huge number of people to see what is actually normal in the real world. The other type of score is criterion-referenced scores. These have interpretations that give meaning to performance by comparing people’s scores to some standard or a specific score (called a cut score). For example, a criterion-referenced interpretation for Stu might be that “He did not meet the criterion of 95% correct necessary for an A+.” In this example, a classroom teacher chose the criterion of 95% correct to get an A+. With criterion-referenced tests, experts usually do some research or apply a theory to determine the criterion. Norms are valuable for lots of reasons, but perhaps most importantly, they allow us to compare outcomes with others in the same test-taker group. For example, the Emotional or Behavior Disorder Scale-Revised (authored by Stephen McCarney and Tamara Arthaud) is used in the early identification of students with emotional or behavioral disorders. In this case, 4,308 students ages 5 through 18 years were used as a group to develop the norms. To know how excessive behavior might be for any one child within that same age range, the child’s score is compared with this set of norms to see how different it is. The most common way to use norms is to take an individual raw score and convert that into one of several values that are more easily understood by students, parents, and teachers and more easily shared in that they have some universal characteristics.

PERCENTILES (OR PERCENTILE RANKS) A very common way of understanding an individual raw score is the normreferenced way, which involves examining the score relative to the rest of the scores in the set. For example, we know that Stu got a score of 18, but what does that mean relative to the other scores in the group? How can we get some juicy norm-referenced information from that raw score? A percentile, or percentile rank, is a point in a distribution of scores below which a given percentage of scores falls. In other words, a percentile is the percentage of

Chapter 5 

■ 

Scores, Stats, and Curves  

people scoring less than a particular raw score. It’s a particular point within an entire distribution of scores, and there are 100 such percentiles or percentile ranks (but really only 99, since the top scorer can score higher than most but not higher than themselves). For example, the 45th percentile (and it looks like this: P45) is the score below which 45% of the other scores fall. Percentiles and percentile ranks (terms that are often used interchangeably) are probably the score most often used for reporting test results in schools. Percentiles Mean Different Things to Different People. Sometimes percentiles are defined as the point in a distribution of scores at or below which a given percentage of scores falls. So, a percentile could be the percentage of people scoring equal to or less than a particular raw score. Notice the difference. Our definition means that the highest percentile rank possible is 99th (or 99.99th if you have a lot of people), because 100% of people can’t be below your score, but almost 100% could. With this other definition, you could be at the 100th percentile because that 100% includes people below your score plus those people who also got all the points possible. The definition being used probably doesn’t matter much as our interpretations wouldn’t be different in a meaningful way, but you should pay attention to which definition is being used. But you may have also noticed that a percentile tells us nothing about the raw score. If someone has a percentile rank of 45, it means that they could have a raw score of 88 or a raw score of 22. We just don’t know. However, the lower the percentile, the lower the person’s rank in the group. Percentage or percentile? These are very different animals. A percentile does not tell us the percentage of questions that someone got right. Don’t confuse the two. It’s conceivable that someone could get 50% correct on a test and be in the 99th percentile, right? That could happen if the test was really hard and almost everyone except for that person got less than half of the items correct. And yes, there’s a quick and easy way to compute percentiles. (Actually, to be honest, it’s not quick, but it is easy.) Here is the formula for computing the percentile for any raw score in a set of scores: Pr =

B ×100 N

where Pr = the percentile B = the number of observations with lower values N = the total number of observations

89

90  Part I 

■ 

The Basics

Let’s compute Stu’s percentile given that his raw score is 18. Here’s how. 1. Rank all the scores with the lowest value (that’s descending order) at the bottom of the ranking. You can see we did that in Table 5.2. 2. Count the number of occurrences where the score is less than Stu’s (which appears in bold and is 18). So Juan (who got a score of 17) is the first one to be counted. Counting down, there are 41 scores, or people, worse than Stu. 3. The total number of scores is 50 (which is N). Once those values are plugged into the formula we showed you above, the percentile for a raw score of 18 in the set of scores shown in Table 5.1 is 82, as shown here: Pr =

TABLE 5.2 

41 × 100 = .82 50

Scores Ranked by Descending Order

ID

Name

Raw Score

ID

Name

Raw Score

 2

Aliyah

20

30

Keith

13

18

Rachael

20

34

Zion

12

21

Bella

19

40

Jan

12

29

Carlos

19

 1

Mark

12

47

Camila

19

14

Micah

11

11

Jayden

19

20

Nevaeh

11

24

Dave

19

27

Malik

11

35

Adam

19

32

Pam

 9

36

Stu

18

43

Fatima

 9

 3

Juan

17

48

Xavier

 9

19

Amir

17

49

Ann Marie

 9

26

Sabey

17

 4

Millman

 8

37

Nancy

17

38

Jada

 8

16

Trinity

16

10

Dave

 8

22

Max

16

13

Sara

8

33

Sofia

16

17

Mariana

7

39

Suzie

16

25

Lori

7

46

Ignacio

16

31

Mark

6

 5

Karrie

15

44

Joaquin

6

Chapter 5 

ID

Name

 7

■ 

Scores, Stats, and Curves  

Raw Score

ID

Name

Raw Score

Annette

15

 6

Luis

6

 9

Duke

15

41

Kent

5

12

Leni

15

 8

Josh

5

15

Pepper

14

42

Annette

5

23

Valentina

14

45

Deborah

5

28

Maria

14

50

John

5

In this example, the percentile of 82 corresponds with a raw score of 18. Of all the scores in the distribution, 82% fall below Stu’s score of 18.

What’s to Love About Percentiles Here’s the quick and dirty about why percentile ranks are useful: • They are easily understood by all parties (test takers, parents, teachers, specialists). • They are computed based on one’s relative position to others in the group—hence, they are norm referenced. • You can compare across different areas of performance, because a percentile is a percentile is a percentile—and percentiles are independent of raw scores. They are standardized, so they always mean the same thing.

What’s Not to Love About Percentiles Similarly, some things about percentiles don’t endear us: • Percentile ranks reflect raw scores but do not accurately reflect differences between raw scores and differences between percentiles. Percentiles are not at the interval level of measurement (see Chapter 2)! • Oops—is that a percentage or a percentile, and what’s the difference? They are a source of possible confusion. But love them or not, percentiles are here to stay, and you will see them time and again when you seek out information on what a raw test score actually represents.

LOOKING AT THE WORLD THROUGH NORM-REFERENCED GLASSES The classic way to summarize a bunch of people or a bunch of scores is to think of what is typical or average. When we have lots of data, such as personality test scores for a group of adults, we need some way to organize and represent them.

91

92  Part I 

■ 

The Basics

An average is an intuitively pleasing way to produce a single value that best represents an entire group of scores. It doesn’t matter whether the group of scores is the number correct on a spelling test for 30 fifth-graders or how good the hitters are on the Kansas City Royals. Any group of scores can be summarized using an average. Measurement folks describe averages as measures of central tendency, and you might be surprised to hear that there are at least three types of averages, and only one of them is the definition that most people think average refers to. The three big types of averages are mean, median, and mode.

Computing the Mean The mean is the most common type of average. It is the sum of all the values in a group, divided by the number of values in that group. So if you had the spelling scores for 30 fifth-graders, you would add up all the scores and get a total, and then divide by the number of students, which is 30. The formula for computing the mean is shown here: X=

•∑ X n

where – X (called “X bar”) = the mean value of the group of scores or the mean ∑ (or the Greek capital letter sigma) = the summation sign, which tells you to add together whatever follows it X = each individual score in the group of scores n = the size of the sample from which you are computing the mean To compute the mean, follow these steps: 1. List the entire set of values (or enter them into a computer). 2. Compute the sum or total of all the values. 3. Divide the total or sum by the number of values. For example, if you needed to compute the average score for 10 students on a spelling test, you would compute a mean for that value. Here is a set of 10 such scores (a perfect score is 20). Spelling Test Score 15 12 20

Chapter 5 

■ 

Scores, Stats, and Curves  

Spelling Test Score 18 17 16 18 16 11 7

Here are the preceding numbers plugged into the formula: X =

∑ X 15 + 12 + 20 + 18 + 17 + 16 + 18 + 16 + 11 + 7 150 = = = 15 n 10 10

Some Things to Remember About the Mean • The mean is sometimes represented by the letter M and is also called the typical, average, or most central score. It’s usually what the average person means when they say the word average; when they say average, they mean mean (know what we mean?). • The sample mean is the measure of central tendency that most accurately reflects the population mean. • The mean is like the fulcrum on a seesaw. It’s the centermost point, where all the values on one side of the mean are equal in weight to all the values on the other side of the mean. • Finally, the mean is very sensitive to extreme scores. An extreme score can pull the mean in one direction or another and make it less representative of the set of scores and less useful as a measure of central tendency. Another type of average is the median. The median is the 50th percentile, or the point at which 50% of scores fall below and 50% fall above. The median is the midpoint in a distribution of scores. There’s no standard formula for computing the median, it is really a process with a series of steps. To compute the median, follow these steps: 1. List the values in order, from either highest to lowest or lowest to highest. 2. Find the middle-most score. That’s the median.

93

94  Part I 

■ 

The Basics

For example, here are five SAT verbal scores ordered from highest to lowest. 740 560 550 490 480 There are five values. The middle-most value is 550, and that’s the median. Now, what if the number of values is even? Let’s add a value (540) to the list so there are six scores. When there is an even number of values, the median is simply the average of the two middle values. What you call the “average,” measurement folks call the “mean,” as the median technically is a type of average. So, to be fully correct, the median is simply the mean of the two middle values. In this case, the middle two cases are 540 and 550. The average of those two values is 545. That’s the median for that set of six values. What if the two middle-most values are the same, such as in the following set of data? 600 550 550 480 Then the median is the same as both of those middle-most values. In this case, it’s 550. Why use the median instead of the mean? For one very good reason. The median is insensitive to extreme scores, whereas the mean is not. What do we mean by extreme? It’s probably easiest to think of an extreme score as one that is very different from the group to which it belongs. For example, in our original list of five scores, shown again here, 740 560 550 490 480

Chapter 5 

■ 

Scores, Stats, and Curves  

the value 740 is more different from the other five than any other value in the set, and we would consider that an extreme score. The mean (or arithmetic average) of the set of five scores you see above is the sum of the set of five divided by 5, which turns out to be 564. On the other hand, the median for this set of five scores is 550. Given all five values, which is more representative of the group? The value 550, because it clearly lies more in the “middle” of the group, and we like to think about the average as being representative or assuming a central position.

Computing the Mode The mode, after the mean and the median, is the third and last measure of central tendency we’ll cover. It is the most general and least precise measure of central tendency. The mode is the value that occurs most frequently, and there is no formula for computing it. To compute the mode, follow these steps: 1. List all the values in a distribution, but list each only once. 2. Tally the number of times each value occurs. 3. The value that occurs most often is the mode. For example, here is a set of categories representing the possible outcomes of a test and the frequency with which each outcome (a score of 1, 2, or 3) occurs in a sample. Category

Number of Students

1. Pass high

 90

2. Pass

170

3. Fail

 40

The mode is the value that occurs most frequently, which in this example is a score of 2, which means “Pass.” That’s the mode for this distribution. We usually only report the mode as our average when the scores are really just names of categories, not quantities. As in this example, where our 1s, 2s, and 3s are just convenient labels for our categories.

Understanding Variability You just learned about different types of averages, what they mean, how they are computed, and when to use them. But when it comes to descriptive statistics and describing the characteristics of a distribution, averages are only half the story. The other half is measures of variability.

95

96  Part I 

■ 

The Basics

In the simplest of terms, variability reflects how scores differ from one another. For example, the following set of test scores shows some variability: 98, 86, 84, 88, 94 And this next set of scores has the same mean (90) but less variability than the previous set: 92, 89, 91, 90, 88 They are closer together or, to use the math words, the scores are closer to the mean in this set than the first. And this next set has the same mean with no variability at all—the scores do not differ from one another at all. 90, 90, 90, 90, 90 Variability (sometimes called spread or dispersion) can be thought of as a value representing how different scores are from one another and that value we call variance. Variance is how much each score in a group of scores differs from the mean. Each variance has a square root that is actually more useful to use as a measure of variability. We call this square root the standard deviation. By knowing both the mean and the standard deviation of a group of scores we know just about all we need to know.

Computing the Standard Deviation In practical terms, the standard deviation is the average distance of each score from the mean. The larger the standard deviation the more variability in a distribution. Here’s the formula for computing the standard deviation.

∑ (X - X ) s= n −1

2

where s = the standard deviation (because this is a common statistical formula, we are using the s symbol for standard deviation) ∑ = this symbol means to add up, to find the sum of what follows X = “X bar,” the mean of all the scores n = the sample size This formula finds the difference between each individual score and the mean ( X − X ), squares each difference, and sums them all together. Then it divides the

Chapter 5 

■ 

Scores, Stats, and Curves  

sum by the size of the sample (minus 1) and takes the square root of the result. As you can see, and as we mentioned earlier, the standard deviation is an average deviation from the mean. Here are the data we’ll use (with math test scores ranging from 65 to 91) in the following step-by-step explanation of how to compute the standard deviation: Student

Math Test Score

1

67

2

65

3

78

4

80

5

82

6

67

7

91

8

89

9

77

10

84

List each score. It doesn’t matter whether the scores are in any particular order. 1. Compute the mean of the group. 2. Subtract the mean from each score. 3. Square each individual difference. The result is in the column marked – (X − X )2. 4. Sum all those squared differences. As you can see, the total is 758. 5. Divide the sum by n − 1, or 10 − 1 = 9, so then 758/9 = 84.22. 6. Compute the square root of 84.22, which is 9.18. That is the standard deviation for this set of 10 scores. Student

Math Test Score

X

X−X

 1

67

78

−11

 2

65

78

−13

 3

78

78

0

 4

80

78

2 (Continued)

97

98  Part I 

■ 

The Basics

(Continued)

Student

Math Test Score

X

 5

82

78

4

 6

67

78

−11

 7

91

78

13

 8

89

78

11

 9

77

78

−1

10

84

78

6

X−X

Student

Math Test Score

X

X−X

( X − X )2

 1

67

78

−11

121

 2

65

78

−13

169

 3

78

78

0

  0

 4

80

78

2

  4

 5

82

78

4

 16

 6

67

78

−11

121

 7

91

78

13

169

 8

89

78

11

121

 9

77

78

−1

  1

10

84

78

6

 36

Sum = 0

Sum = 758

What we now know from these results is that each score in this distribution differs from the mean by an average of 9.18 points. By the way, you notice we squared values and then took the square root and divided that by n – 1, instead of n and all sorts of other shenanigans? That is done for statistical reasons, so that our sample’s standard deviation will better match the standard deviation in the larger population from which it was drawn. Notice, also, it results in a value that isn’t literally the average distance of each score from the mean. But it’s close enough, or so we are told.

Some Things to Remember About the Standard Deviation • The standard deviation is computed as the average distance from the mean. So you will need to first compute the mean as a measure of central tendency. Don’t fool around with the median or the mode in trying to compute the standard deviation. • The larger the standard deviation, the more spread out the values are, and the more different they are from one another.

Chapter 5 

■ 

Scores, Stats, and Curves  

• If the standard deviation = 0, there is absolutely no variability in the set of scores, and they are identical in value. This will rarely happen.

THE NORMAL CURVE (OR THE BELL-SHAPED CURVE) What is a normal curve? The normal curve (also called a bell-shaped curve, or bell curve or normal distribution) is a visual representation of a distribution of scores that has three characteristics, as shown in Figure 5.1. We call it normal because if you measure almost anything in the natural world—intelligence, hair length on dogs, the heights of mountains on Mars, or the popularity of mustard among residents of France—and scores are allowed to vary, when you plot those scores in a nice graph, it will look like a very specific, well-defined curve. This is especially true when you measure at the interval level! Remember back in Chapter 2, talking about levels of measurement, we advised that you should measure stuff at the interval level whenever you can? Now you know why. What is this a graph of? (Our English teacher friends would prefer we write that sentence as, Of what is this a graph?) Along the bottom, the x-axis, are all the scores from lowest to highest, and along the side, the y-axis, is the percentage of people getting that score. So, the highest point on the curve, in the middle, means that more people got middle scores than very high or low scores. Turns out extreme scores are uncommon (duh!). Some things to note about the normal curve: First, the normal curve represents a distribution of values where the mean, median, and mode are equal to one another. If the median and the mean are different, then FIGURE 5.1 

  The Normal, or Bell-Shaped, Curve Symmetrical

Asymptotic tail Mean Median Mode

99

100  Part I 

■ 

The Basics

the distribution is skewed in one direction or the other. The normal curve is not skewed. It’s got a nice hump (only one), and that hump is right in the middle. Second, the normal curve is perfectly symmetrical about the mean. If you fold the curve along its center line, the two halves will fit perfectly on each other. They are identical. One half of the curve is a mirror image of the other. Finally (and get ready for a mouthful), the tails of the normal curve are asymptotic —a big word. What it means is that as our sample size increases, they come closer and closer to the horizontal axis but never touch.

More Normal Curve 101 You already know the three main characteristics that make a curve normal or make it appear bell shaped, but there’s more to it than that. Take a look at the curve in Figure 5.2. The distribution represented here has a mean of 100 and a standard deviation of 10. We’ve added numbers across the x axis that represent the distance in standard deviations from the mean for this distribution. You can see that the x axis (representing the scores in the distribution) is marked from 70 through 130 in increments of 10 (which is the standard deviation for the distribution), the value of 1 standard deviation. As we stated earlier, this distribution has a mean of 100 and a standard deviation of 10. Each vertical line within the curve separates the curve into a section, and each section is bound by particular scores. For example, the first section to the right of the mean of 100 is bound by the scores 100 and 110, representing 1 standard deviation from the mean (which is 100). FIGURE 5.2 

A Normal Curve Divided Into Different Sections

Raw score

70

80

90

Standard deviations

−3

−2

−1

100 (Mean) 0

110

120

130

1

2

3

Chapter 5 

■ 

Scores, Stats, and Curves  

And below each raw score (70, 80, 90, 100, 110, 120, and 130), you’ll find a corresponding standard deviation (–3, –2, –1, 0, +1, +2, and +3). As you may have figured out already, each standard deviation in our example is 10 points. So 1 standard deviation from the mean (which is 100) is the mean plus 10 points, or 110. If we extend this argument further, then you should be able to see how the range of scores represented by a normal distribution with a mean of 100 and a standard deviation of 10 is 70 through 130 (which includes standard deviations from −3 to +3). Now here’s a big fact that is always true about normal distributions, means, and standard deviations: For any distribution of scores (regardless of the value of the mean and standard deviation), if the scores are distributed normally, almost 100% of the scores will fit between −3 and +3 standard deviations from the mean. This is very important, because it applies to all normal distributions. Because the rule does apply (once again, regardless of the value of the mean or the standard deviation), distributions can be compared with one another. With that said, we’ll extend our argument a bit more. If the distribution of scores is normal, we can also say that between different points along the x axis (such as between the mean and 1 standard deviation), a certain percentage of cases will fall. In fact, between the mean and 1 standard deviation above the mean (which is 110), about 34% (actually 34.13%) of all cases in the distribution of scores will fall. Want to go further? Take a look at Figure 5.3. Here you can see the same normal curve in all its glory (the mean equals 100 and the standard deviation equals 10)— and the percentage of cases that we would expect to fall within the boundaries defined by the mean and the standard deviation. FIGURE 5.3 

  Distribution of Cases Under the Normal Curve

34.13% 34.13% 13.59%

13.59%

2.15%

2.15%

.13%

.13%

Raw score

70

80

90

Standard deviations

−3

−2

−1

100 (Mean) 0

110

120

130

1

2

3

101

102  Part I 

■ 

The Basics

Here’s what we can conclude: And the scores that are included (if the mean = 100 and the standard deviation = 10) are from . . .

The distance between . . .

Includes . . .

The mean and 1 standard deviation

34.13% of all the cases under the curve

100 to 110

1 and 2 standard deviations

13.59% of all the cases under the curve

110 to 120

2 and 3 standard deviations

2.15% of all the cases under the curve

120 to 130

3 standard deviations beyond and above

0.13% of all the cases under the curve

Above 130

If you add up all the values in either half of the normal curve, guess what you get? That’s right, 50%. Why? The distance between the mean and all the scores to the right of the mean underneath the normal curve includes 50% of all the scores. And because the curve is symmetrical about its central axis (each half is a mirror image of the other), the two halves together represent 100% of the scores. Now let’s extend the same logic to the scores to the left of the mean of 100. And the scores that are included (if the mean = 100 and the standard deviation = 10) are from . . .

The distance between . . .

Includes . . .

The mean and –1 standard deviation

34.13% of all the cases under the curve

90 to 100

–1 and –2 standard deviations

13.59% of all the cases under the curve

80 to 90

–2 and –3 standard deviations

2.15% of all the cases under the curve

70 to 80

–3 standard deviations beyond and below

0.13% of all the cases under the curve

Below 70

Now, be sure to keep in mind that we are using a mean of 100 and a standard deviation of 10 only as sample figures for a particular example. Obviously, not all distributions have a mean of 100 and a standard deviation of 10. All of this is pretty neat, especially when you consider that the values of 34.13% (or about 34%) and 13.59% (or about 14%) and so on are absolutely independent of the actual values of the mean and the standard deviation. The value is 34% because of the shape of the curve, not because of the value of any of the scores in the distribution or the value of the mean or standard deviation.

Chapter 5 

■ 

Scores, Stats, and Curves  

In our example, this means that (roughly) 68% (34.13% doubled) of the scores fall between the raw score values of 90 and 110. What about the other 32%? Good question. One half (16%, or 13.59% + 2.15% + 0.13%) falls above (to the right of) 1 standard deviation above the mean, and one half falls below (to the left of) 1 standard deviation below the mean. And because the curve slopes, and the amount of area decreases as you move farther away from the mean, it is no surprise that the likelihood that a score will fall more toward the extremes of the distribution is less than the likelihood that it will fall toward the middle. That’s why the curve has a bump in the middle and is not skewed in either direction. The value of all this? Simple. In any set of test scores that is normally distributed, we can assume that there are probabilities of occurring associated with specific scores. For example, the probability of any one student getting a 110 or below is about 84% or 34.13% + 50%. And these probabilities can be used as benchmarks to help us understand how likely (or unlikely) it is that a particular outcome will occur. By the way, while this way of thinking applies to scores on tests, this same logic (knowing probabilities of certain values occurring based on some curve like this) is how we do all of inferential statistics! (So you can skip that stats course. Whew!) It’s also part of the “secret” knowledge that professional gamblers use to win at poker.

THE STANDARD STUFF Most norm-referenced scores are also standard scores. A standard score is one that (surprise) is standardized based on a known set of mathematical rules. For most standardized scores, like the z scores and T scores we are about to discuss, the unit of size is standard deviations. That is, most standard scores describe the distance of a raw score from the mean in standard deviations. You’ll see what we mean, as we start with z scores.

Our Favorite Standard Score: The z Score Although there are several types of standard scores, the one you will see the most frequently in your tests and measurement (and your statistics) work is the z score. Actually, you don’t really see it much, but it is hidden within the math of most standard scores. A z score tells you whether a raw score is above or below the mean and how far above or below the mean it is. To calculate a z score, we transform a raw score by subtracting the mean from the raw score and then dividing that difference by the standard deviation of the set of scores. z=

X− X SD

103

104  Part I 

■ 

The Basics

where z = the z score X = the raw score – X = the mean of the set of test scores SD = the standard deviation of the set of test scores For example, here you can see how the z score is calculated for Stu with a raw score of 18. The standard deviation for the set of test scores you saw in Table 5.4 is 5 and the mean is 13. z=

18 − 13 =1 5

In Table 5.5, you can see the z scores that are associated with each of the raw scores in that big set of scores that we have been working with throughout this chapter. TABLE 5.5 

z Scores Associated With Raw Scores

ID

Name

Raw Score

z Score

ID

Name

Raw Score

 1

Mark

12

−0.20

26

Sabey

17

0.80

 2

Aliyah

20

1.40

27

Malik

11

−0.40

 3

Juan

17

0.80

28

Maria

14

0.20

 4

Millman

 8

−1.00

29

Carlos

19

1.20

 5

Karrie

15

0.40

30

Keith

13

0.00

 6

Luis

 6

−1.40

31

Mark

 6

−1.40

 7

Annette

15

0.40

32

Pam

 9

−0.80

 8

Josh

 5

−1.60

33

Sofia

16

0.60

 9

Duke

15

0.40

34

Zion

12

−0.20

10

Dave

 8

−1.00

35

Adam

19

1.20

11

Jayden

19

1.20

36

Stu

18

1.00

12

Leni

15

0.40

37

Nancy

17

0.80

13

Sara

 8

−1.00

38

Jada

 8

−1.00

14

Micah

11

−0.40

39

Suzie

16

0.60

15

Pepper

14

0.20

40

Jan

12

−0.20

16

Trinity

16

0.60

41

Kent

 5

−1.60

17

Mariana

 7

−1.20

42

Annette

 5

−1.60

z Score

Chapter 5 

Raw Score

■ 

Scores, Stats, and Curves  

ID

Name

Raw Score

z Score

1.40

43

Fatima

 9

−0.80

17

0.80

44

Joaquin

 6

−1.40

Nevaeh

11

−0.40

45

Deborah

 5

−1.60

21

Bella

19

1.20

46

Ignacio

16

0.60

22

Max

16

0.60

47

Camila

19

1.20

23

Valentina

14

0.20

48

Xavier

 9

−0.80

24

Dave

19

1.20

49

Ann Marie

 9

−0.80

25

Lori

 7

−1.20

50

John

 5

−1.60

ID

Name

18

Rachael

20

19

Amir

20

z Score

Bet you can figure out that any raw score above the mean (which is 13 for this set of scores) will have a corresponding z score that is positive, and any raw score below the mean will have a corresponding z score that is negative. For example, a raw score of 15 has a corresponding z score of +.40, and a raw score of 7 has a corresponding z score of –1.20. And, of course, a raw score of 13 (or the mean) has a z score of 0 (which it must, because it is no distance from the mean). The following are just a few observations about these z scores, as a little review: 1. Those scores below the mean (such as 8 and 10) have negative z scores, and those scores above the mean (such as 14 and 16) have positive z scores. 2. Positive z scores always fall to the right of the mean (on the normal curve) and are in the upper half of the distribution of all the scores. Negative z scores always fall to the left of the mean and are in the lower half of the distribution. 3. When we talk about a score being located one standard deviation above the mean, it’s the same as saying that the z score is 1. For our purposes, when comparing test scores across distributions, z scores and standard deviations are equivalent. In other words, a z score is simply the number of standard deviations that the raw score is from the mean. 4. Finally (and this is very important), z scores across different distributions are comparable. A z score of 1 will always represent the same relative position in a set of scores, regardless of mean and standard deviation and raw score used to compute the z score value. This quality is exactly what makes z scores so useful; they are so easily compared across different settings and different testing situations and tests. They are directly comparable across any test situation, if you buy into that whole normreferenced way of thinking, making them very handy.

105

106  Part I 

■ 

The Basics

What’s to Love About z Scores You already know the answer to this one: z scores are easily compared across different sets of scores. Want more? Okay . . . • z scores are easily computed. • z scores are easy to understand as the distance from the mean for each score in a distribution. They tend to range from –3 to +3 (because the normal curve has about 6 standard deviations under it) and its mean is 0.0. • No more fooling around with raw scores—z scores tell you exactly where a test score lies and what its relative relationship is to the entire set of scores.

What’s Not to Love About z Scores There is always that nagging flip side: • z scores may not be easy for naive parties, such as parents, students, and nontesting professionals, to understand. • Low scores of any kind—and z scores are a perfect example—always carry the connotation of poor performance. Imagine being told your child got a score of z = –0.2! But that’s basically an average score. • It’s difficult to interpret any score on a test that has a decimal component, such as 1.0 or 1.5. Try explaining, “You did very well on your exam, with a final test score of 1.87.” You will find z scores and percentile ranks often used to express a test score position relative to the other test scores in a group. But how do z scores and percentile ranks relate to each other? In an interesting and straightforward way. You may remember in your basic stat class that the about 84% of all scores in a normal distribution are less than a z score of +1. Bingo. A z score of 1 corresponds to the 84th percentile, or a percentile rank of 84. And a z score of 0, which would correspond to a raw score equivalent of the mean of the set of scores, is equal to a percentile of 50%. This is also the median on a normal curve!

T Scores to the Rescue Now here’s a good idea. One of the outstanding disadvantages of z scores is the less-than-outstanding perception of any negative score. Plus, who wants a system where the average score is 0? So to get around these concerns, why not use a type of standard score that will never be negative and has an average that seems more impressive. That’s how T scores, a variant of z scores, were born.

Chapter 5 

■ 

Scores, Stats, and Curves  

A T score is a standard score as well, only one that uses z scores as shown in the following: T = 50 + 10z where T = the T score z = the z score For example, Stu’s raw score of 18, which is equal to a z score of 1, is equal to a T score of 60, or 50 + 10(1). You may realize immediately that what this alternative standard score does is eliminate any negative numbers, as well as most fractional scores. For example, take a look at the top and bottom 10 scores from our original set of scores shown in Table 5.1. ID

Name

Score

z Score

T Score

 2

Aliyah

20

1.40

65

18

Rachael

20

1.40

65

11

Jayden

19

1.20

62

21

Bella

19

1.20

62

24

Dave

19

1.20

62

29

Carlos

19

1.20

62

35

Adam

19

1.20

62

47

Camila

19

1.20

62

36

Stu

18

1.00

60

 3

Juan

17

0.80

58

17

Mariana

 7

−1.20

38

25

Lori

 7

−1.20

38

 6

Luis

 6

−1.40

36

31

Mark

 6

−1.40

36

44

Joaquin

 6

−1.40

36

 8

Josh

 5

−1.60

34

41

Kent

 5

−1.60

34

42

Annette

 5

−1.60

34

45

Deborah

 5

−1.60

34

50

John

 5

−1.60

34

107

108  Part I 

■ 

The Basics

Here, you can see that the best score (20) has a corresponding z score of 1.40 and a T score of 65. And even for the lowest raw score of 5 with a corresponding z score of −1.60, the corresponding T score is 34—a nice positive number. Feeling better, John? z scores and T scores are similar in that they are both transformed scores, and both are comparable across different distributions. z scores are comparable because they use the same metric—the standard deviation. T scores are comparable because they use the same metric—the z score. Finally, a set of z scores generated from a distribution of raw scores has a mean of 0 and a standard deviation of 1, whereas a set of T scores generated from the same distribution has a mean of 50 and a standard deviation of 10.

STANDING ON YOUR OWN: CRITERION-REFERENCED TESTS Here’s the alternative to norm-referenced, norm-based, normalized, and so-onbased tests—criterion-based tests. A criterion-referenced test is one where there is a predefined level of performance used to evaluate outcomes, and it has nothing to do with relative ranking among test takers. For example, if a score of “75% correct” on a multiple-choice driving test is the cut score to get your driver’s license, that’s the criterion in this criterion-referenced system. And you got 75% correct. Well, congrats, you passed. You did “well.” But imagine that most people score 75% or higher. Your percentile rank of all people might still be relatively low, right? But you’re still a star because you surpassed the criterion. The importance is on performance rather than relative position compared to others. We expect people to be able to do thousands of things that must be based on an absolute scale of success. We expect police officers to know the rules of law enforcement—not just 50% of the rules. We expect new nurses to be able to start an intravenous line with a 100% level of success—not just better than their classmates but perfect every time. And we expect school bus drivers to drive with few miscalculations—not just to be better than the other drivers. There’s an old joke about medical schools and their use of criterion-referenced standards to give out degrees. What do you call the student who ranks last in their graduating class? Doctor. 😊

THE STANDARD ERROR OF MEASUREMENT This fascinating concept is important when it comes to understanding how to interpret a test score! No one ever gets the exact same score on a test if that test is taken more than once. That’s the whole idea behind reliability, and you remember

Chapter 5 

■ 

Scores, Stats, and Curves  

our discussion in Chapter 3 on reliability and how there are all kinds of influences that can affect a score and make it a bit random. The standard error of measurement (or SEM) is a simple way to quantify how much a test score varies for an individual from time to time and from test to test. It literally estimates the distance between observed scores and true scores (see Chapter 3 for more about these terms). Using this value can give us an estimate of the accuracy of any one test score. In theory, the SEM is the standard deviation or the amount of spread that each observed score differs from each true score (and remember that true score can never be measured directly). Another way to think of the SEM is that it is the standard deviation of repeated test scores. Of course, we cannot ask someone to take a test an infinite number of times, so we use a handy formula that helps us estimate the SEM using the standard deviation of the original set of scores and the reliability coefficient, which was already computed when the test was developed. Here’s the formula: SEM = SD 1 − r where SEM = the standard error of measurement SD = the standard deviation for the set of test scores r = the reliability coefficient of the test For the example we have used throughout this chapter, SD = 5 and let’s say that r = .87 (we got the 5 from computing the standard deviation for the set of 50 raw scores and an internal reliability estimate using coefficient alpha). And the value of the SEM is 1.80, computed as follows: SEM = 5 1 − .87 = 1.80

What the SEM Means The SEM is a measure of how much variability we can expect around any one individual’s score on repeated testing. For example, what we know about normal curves and the distribution of scores, we also can apply to the SEM. For example, Stu got a score of 18 and the SEM for the test is 1.80. This can be interpreted to mean that the chances of Stu’s true score (remember—his “typical”

109

110  Part I 

■ 

The Basics

score) falling within 18 ± 1.80 (a range from 16.2 to 19.8) is about 68%. We know this because we know from our knowledge of the normal curve that 68% of all scores are contained within plus or minus one standard unit from the mean. Only this time, the standard unit is one SEM.

∑X ∑X XX = n n 15 +12 + 20 +18 +17 +16 +18 +16 +11+ 7 = 10 150 = = 15s 10

X=

∑ ( X − X )2 XX − XX − XXX − XX − XX − Xs 2 n −1 ∑ ( X − X )2 = n −1

=

Summary We are spending the first part of this book on understanding some of the important psychometric qualities of tests, such as reliability, validity, and scoring. And now we have covered many of the statistical and mathematical concepts! Many of these quantitative ways of understanding measurement came from what is now called classical test theory (that whole observed score = true score + error score business). We’ll end this first section of our book with Chapter 6, a discussion of an alternative to that traditional model of testing and an introduction to what is now called modern test theory. It’s time to level up, gang!

Time to Practice 1.

What are the advantages and disadvantages of using raw scores to report test results?

2. Using the following set of test scores, compute the percentile rank for Test Taker 12, who had a raw score of 64. Remember that you have to rank all the scores before you can use the formula shown in this chapter. ID

Score

ID

Score

 1

78

11

67

 2

67

12

64

 3

66

13

58

 4

56

14

87

 5

78

15

88

Chapter 5 

■ 

Scores, Stats, and Curves  

ID

Score

ID

Score

 6

89

16

89

 7

92

17

90

 8

96

18

98

 9

86

19

85

10

46

20

68

3. Emily and Grace are in different groups, and the following shows their test scores on two different tests. If we assume that a higher score is better, which one scored better overall? Here’s more information than you need because that’s how we roll: • The mean and the standard deviation for Emily’s group on Test 1 were 90 and 3. • The mean and the standard deviation for Emily’s group on Test 2 were 92 and 5. • The mean and the standard deviation for Grace’s group on Test 1 were 87 and 3. • The mean and the standard deviation for Grace’s group on Test 2 were 93 and 5. Score on Test 1

Score on Test 2

Emily

88

91

Grace

92

94

4. What’s the best thing about standard scores (or why are they so useful)? 5. Provide an example of when it is appropriate to use a criterion-referenced test rather than a norm-referenced test. Explain the rationale for your decision. 6. On a test of physical strength, with higher scores representing more physical strength, Annie’s result is a T score of 65 and Sue’s result is a z score of 1.5. According to this test, who is stronger? 7.

Convert the following z scores to T scores: –1, 0, 1.

8. On a test of attention deficits, an individual produces a T score of 60. Instead of telling this individual, “You have a T score of 60 on this measure,” how could you explain this finding in a way that would help the individual understand the practical meaning behind it? 9.

What does it mean when the SEM is very small? Very big?

Want to Know More? Further Readings •

Betebenner, D. (2009). Norm-and criterion-referenced student growth. Educational Measurement: Issues and Practice, 28(4), 42–51.

If we want to know if students are learning, it makes a difference whether we use norm-referenced or criterion-referenced score interpretations.

111

112  Part I 



■ 

The Basics

Mangan, K. S. (2004). Raising the bar. Chronicle of Higher Education, 51(3), A35–A36.

Think that test scores don’t have wide-ranging implications? This article discusses the imposition of tougher law-exam standards in an effort to get better lawyers. However, there’s no proof that requiring higher scores on a standardized test would weed out incompetent lawyers, and it could exclude the minority students whom law schools are eager to attract. Quite a lot to think about.

And on Some Interesting Websites •

The classic article on standard scores by Robert Ebel at http://epm.sagepub.com/content/22/1/15. The article, titled “Content Standard Test Scores,” gives you insight into how standard scores were conceived and used during the past 59 years. Revealing and interesting history.



The National Education Association provides a ton of stuff about testing (search “testing” on its website) as well as several discussions about what test scores mean and how to use them. Their recent position papers and articles can be found at https://www.nea.org/advocating-forchange/new-from-nea?nea_today=1&.

And in the Real Testing World Real World 1 Wouldn’t it be great if we had perfect tests that could provide really valuable information on the validity of test-based decisions about readiness for a course or a profession? This article goes into great depth about this topic and helps us understand how such decisions can be evaluated in terms of the match between the procedures used to set the passing scores and the purpose of the decision, the internal consistency of the results, and comparisons to other criteria. Want to know more? Kane, M. (1994). Validating the performance standards associated with passing scores. Review of Educational Research, 64, 425–461.

Real World 2 Here’s the classic example of how SEM is used to see how consistent a test score is across several administrations. This study assessed whether the Movement ABC can be used to monitor individual change in motor performance and found that the total score of the Movement ABC is sensitive enough to monitor individual change and that individual items are inappropriate to monitor individual change. Want to know more? Leemrijse, C., Meijer, O., Vermeer, A., Lambreqts, B., & Adèr, H. (1999). Detecting individual change in children with mild to moderate motor impairment: The standard error of measurement of the Movement ABC. Clinical Rehabilitation, 13, 420–429.

Real World 3 We’re sure you are not surprised that percentiles and percentile ranks are used most often when reporting large groups of scores and the scores’ relative positions. After all, that’s their primary purpose. This study recommends using percentiles to understand health education levels in Australia. •

Want to know more? Elsworth, G. R., & Osborne, R. H. (2017). Percentile ranks and benchmark estimates of change for the Health Education Impact Questionnaire: Normative data from an Australian sample. SAGE Open Medicine, 5. doi:2050312117695716

6 ITEM RESPONSE THEORY The “New” Kid on the Block Difficulty Index ☺ (These concepts sometimes make our heads hurt. On the other hand, it is the shortest chapter in the book!)

LEARNING OBJECTIVES After reading this chapter, you should be able to • Describe the brief history of Item Response Theory and how it attempts to measure latent traits. • Interpret an item characteristic curve. • Define item difficulty and item discrimination within Item Response Theory. • Identify the difficulty, discrimination, and guessing parameters using an item characteristic curve. • List a few software options for Item Response Theory analyses and what information they produce.

I

tem Response Theory is not exactly new, but compared with Classical Test Theory it is a giant leap forward, and for large test development companies, it’s pretty much all they follow. It’s so “on trend” that Item Response Theory is often simply called Modern Test Theory. Researchers from the Biometrics Unit of the Institut Régional du Cancer Montpellier in France compared Item Response Theory with 113

114  Part I 

■ 

The Basics

the more classical methods of testing to analyze health-related quality-of-life data. The researchers found that the Item Response Theory approach had the advantage of being more precise and suitable because of its direct use of raw data (it provides more information), though the score interpretations were similar. Pretty neat. Want to know more? Barbieri, A., Anota, A., Conroy, T., Gourgou-Bourgade, S., Juzyna, B., Bonnetain, F., Lavergne, C., & Bascoul-Mollevi, C. (2016). Applying the longitudinal model from item response theory to assess healthrelated quality of life in the PRODIGE 4/ACCORD 11 randomized trial. Medical Decision Making, 36, 615–628.

THE BEGINNINGS OF ITEM RESPONSE THEORY In the first five chapters of this book, you learned quite a bit about the underpinnings of some of the basic concepts upon which much of Classical Test Theory is based. Notice the word classical in the name of this approach. This implies that it is an early theory that has been surpassed by something new, and that is kind of true. But the new theory, which we look at in this chapter, hasn’t replaced Classical Test Theory, it’s expanded it, redefined some terms, and allowed for much more precise measurement in the social sciences. Classical Test Theory is still, by far, the most prominent model used by researchers and even many large-scale test developers, but this new system is very cool and you’ll want to know about it. This new way of thinking is Item Response Theory (IRT). IRT is a particular perspective on how test items should be developed and evaluated as they are refined to best take advantage of the fact that individual items, just like total test scores, have their own individual reliabilities. (IRT is obsessed with reliability, even more than validity.) So when discussing IRT, rather than focus on total test scores (and sources of error such as method error or trait error, as we discussed in Chapter 3), we focus on the individual test items and how well they discriminate among individuals and their various underlying levels of the trait being measured. Essentially, Item Response Theory uses information about the difficulty of each item and how it depends on the ability of each person taking the test. You already know that the world of tests and measurement is filled with terms and even a bit of jargon. It’s no different here. Item Response Theory is also referred to as Latent Trait Theory, Strong True Score Theory, and Modern Test Theory. So if you see one of these other terms, you’ll know we are talking about the same general topic here. What’s so cool about IRT, and why is it such an attractive alternative to Classical Test Theory (CTT)? Most uniquely, it focuses on and estimates the ability of the test taker independent of the difficulty of the items. IRT does this by looking

Chapter 6 

■ 

Item Response Theory  

at the relationship between the performance on each individual item on the test and the underlying (Important! Important! Important!) ability of the test taker. Why underlying? Because the level of ability is not explicit; it’s never a known quantity. Item Response Theory, 1; Classical Test Theory, 0: Think of a 10-item achievement test, like some imaginary super-short ACT test. In CTT, there are 11 different possible scores (0 correct, 1 correct, 2 correct, etc., up through 10 correct—so 11 possibilities). But in IRT, there are as many possible outcomes as there are possible item combinations, so one possible outcome could be Item 1 correct and the rest of the items incorrect. Or Items 1 and 7 correct and the rest incorrect. Each of the ten items has two possible outcomes, right? The answer was either correct or not correct. Mathwise, that means the possible number of combos, the number of different ways to get a particular total score, is the possible outcomes raised to the power of the number of items—in this example, 210 or 1,024. So rather than a mere 11 data points for performance, there are now 1,024—a much finer scale on which to grade outcomes, allowing more precision and better decisions. So, what about this item–ability relationship? We start with the assumption that everyone has some level of ability. We usually use the term ability when talking about IRT because it was developed for cognitive and achievement tests, but when we say ability, we really mean the construct, whatever it is. For example, we can use IRT to build attitude scales and personality tests, and we still use the word ability to describe what is being measured. Whether you score 100% or 50% correct on an achievement test, this score acts as an indirect measure of ability. And while it may not reflect the exact true ability of the individual, as we know from Classical Test Theory, this score is a starting point. This underlying construct, which is critical to the theory, is termed a latent trait (meaning a hidden or unobservable trait), and while present, it is not obvious. But even with an observed score of 100% or 50%, test scores are only estimates and more or less correspond to the true score or typical score of the individual. But as you know from Chapter 3, Reliability and Its Importance, this obtained score is only an estimate of the average score an individual would receive if they took the same test an infinite number of times, and true score remains a theoretical concept. The job of the modern psychometrician is to estimate as accurately as possible one’s true underlying latent trait score. IRT can do a better job at this than the techniques associated with Classical Test Theory, as Item Response Theory considers each individual item, not just the total score, and looks at the patterns of responses across all items and all test takers to get an accurate estimate of individual ability. The development of IRT was based mostly on the work of Robert Lord, Melvin Novak, and Alan Birnbaum in their seminal papers, first published in the 1960s

115

■ 

The Basics

and 1970s and then brought to the attention of the tests and measurement world. So if IRT is so cool and useful, why did it take so long to become popular and why isn’t it even more popular now? Easy—a simple, one-word answer: computers. The use of IRT requires huge amounts of data, and the more data, the more effective and accurate the analysis. We are talking 500 individual test results—and thousands more are commonplace—for the results of the analysis to be trustworthy. Computers are also necessary because of the complexity of the calculations that are required to make IRT work. Hundreds and hundreds (if not thousands and thousands) of calculations are necessary to create the models that are necessary to test the quality of the outcomes, and while possible without computers, it is impractical. So, IRT may have been invented five decades ago, but we had to wait for technology to catch up.

THIS IS NO REGULAR CURVE: THE ITEM CHARACTERISTIC CURVE The most fundamental aspect of understanding IRT is what is called the item characteristic curve (ICC), shown in Figure 6.1. Let’s take a look at this sample curve and what it represents. Remember, this is the curve for just one item on a test, and for each item there is one item characteristic curve. The x- or horizontal axis represents the construct, the latent or underlying trait or ability that the individual test taker brings to the item itself, and the “amount” of trait is expressed along that axis. IRT folks call this underlying ability theta, FIGURE 6.1 

  The Item Characteristic Curve for One Item

1.0

Probability Correct

116  Part I 

0.5

0.0 −6

−4

−2

0

2

Ability (Theta or θ)

4

6

Chapter 6 

■ 

Item Response Theory  

represented by the Greek symbol θ. Average ability is located at 0 on the x-axis (that’s the mean ability level), above-average ability to the right, and below-average ability to the left. In fact, those values are just like the z scores we talked about in Chapter 5. The y- or vertical axis is the probability of getting the item correct—and that’s represented as P(θ)—or the probability of a correct response given a certain level of ability or θ. You can see it ranges from 0 to 1, or 0% to 100%. The lower this value, the more difficult the particular item. The higher the value, the more likely that the test taker will get it correct. If you think about it, the probability of getting an item correct is the difficulty of the item. Let’s put x and y together. If you follow the curve, you can see how the higher the ability for a person, the more likely (the higher the probability) that they will see success on this one item. The higher the value of θ, the higher the probability of getting a question right. At the lowest levels of ability, the probability of getting a correct response should be very low. At the highest levels of ability, the probability of getting a correct response should be very high. You can see this by examining the curve in Figure 6.1 for one hypothetical item. Look at it for a while. We’ll wait.

TEST ITEMS WE LIKE—AND TEST ITEMS WE DON’T Within IRT, there are two characteristics of any one item that distinguish items from one another and also allow us to pass judgment on whether the item is a “good” one. Remember that in the IRT world, good items are reliable at different levels of ability. The first characteristic is the difficulty level of an item as noted by the location of the curve along the x-axis. (In IRT, items don’t have just one difficulty level; they have different difficulties for each ability level—a whole curve of difficulties.) The farther to the right the curve lies (toward higher ability), the more difficult the item. The farther to the left (toward lower ability), the easier the item. For example, in Figure 6.2, you can see how, for all the items, the probability of getting the item correct increases as theta, the individual’s level of ability, increases. So, overall, Item 1 is easier than Item 2, which in turn is easier than Item 3. It takes less ability to get Item 1 correct than it does to get Item 2 correct than it does to get Item 3 correct. In other words, given the same level of ability (θ = .5, in this example), the probability of getting the item correct varies as a function of difficulty level. Test developers tend to like items with curves that have a 50% difficulty (a 50% chance of getting an item wrong and a 50% chance of getting an item right) for a person with average

117

■ 

The Basics

FIGURE 6.2 

  How Items Differ in Their Difficulty Level

1.0

Probability Correct

118  Part I 

ITEM 1 ITEM 2 ITEM 3

0.5

0.0

−6

−4

−2 0 2 Ability (Theta or θ)

4

6

ability, or a theta, θ, of 0. (Told you this chapter was kind of hard. Imagine having to deal with statements like “theta, θ, of 0”!) Items with these sorts of curves are more reliable for more people. So, test developers might like Item 1 the best. The discrimination level of an item, the second characteristic, is reflected in the steepness of the item characteristic curve. Think about what steepness means here. The steeper the curve, the stronger the relationship between ability and the chance of getting a question right. With a steep curve, even a small change in ability leads to a change in the item difficulty. For a curve that isn’t that steep, a change in ability level doesn’t really affect difficulty that much. A relationship between the level of the construct and performance on a test? That sounds like validity evidence, as we talk about in Chapter 4, right? And it’s okay to think of IRT’s use of the term discrimination as meaning validity! Check out Figure 6.3. The curve for Item 1 is steep and discriminates well. You can see that for the same degree of ability (θ), the probability of success can vary greatly. On the other hand, Item 2 does not discriminate well because the probability of a correct response is very similar regardless of underlying ability. In fact, the curve even seems to go backward a bit as ability increases! That can’t be good. What would perfect discrimination look like? Maybe as shown in Figure 6.4. Imagine this is an item for a test where the purpose is to identify people at the 2.0 level of ability (a theta of 2, which might get the top 2% or so of test takers).

Chapter 6 

  How Items Differ in Their Discrimination Level

FIGURE 6.3  1.0

Item 2

Probability Correct

Item 1

0.5

0.0 −6

−4

−2 0 2 Ability (Theta or θ)

4

6

4

6

  A Perfectly Discriminating Item

FIGURE 6.4 

Probability Correct

1.0

0.5

0.0

−6

−4

−2 0 2 Ability (Theta or θ)

■ 

Item Response Theory  

119

120  Part I 

■ 

The Basics

In this imaginary example (it’s too good to be true), at an ability level of θ = 2.0, the item discriminates perfectly at that ability level. That is, all the individuals with a theta less than 2 (those below the cutoff) have a very low probability of getting the item correct, and all the individuals with a theta more than 2 (higher-ability folks) have a 100% chance of getting the item correct! That is a very valid item; let’s hire whoever wrote that item for our testing company.

UNDERSTANDING THE CURVE To be able to describe an IRT curve in the ways that matter most, three different characteristics have been defined, and each curve can differ from others on one or all of them. The slope of the curve, which we mentioned earlier, is the discrimination level, identified with an a. When the curve is steeper, there is a large difference in the probability of a correct response for those whose theta values differ. When the curve is flatter, there is very little difference. So steep curves make for nice a’s. The characteristic that defines the difficulty level of the items is b; b is the theta value at which there is a 50% chance of getting the item correct. A higher value of b indicates that the item is more difficult than a lower value of b. When the b, or difficulty level, is less than 0 (remember, a theta of 0 is average, just like with z scores), lower-ability-level individuals have a good chance of getting the question correct (these are fairly easy items), and when b is greater than 0, it takes a higher ability level to get it right (these are harder items). Finally, there is a third characteristic of an item, c. It is, essentially, the probability of a correct response for test takers who have a low ability level and are basically just guessing. So this is basically the chances of guessing the right answer. This is a very similar to the situation with the four answer options on a multiple-choice question, in which you can just randomly guess and still have a 25% chance of getting the item right. With a multiple-choice question, in real life, even a low-ability test taker has a decent chance of getting the question correct, so IRT folks take that into account and use c to describe it. Lawrence Rudner created a terrific applet, PARAM, for varying these three parameters, a, b, and c, and seeing what the item characteristic curve looks like. We will use it to look at some of the following examples of how the item characteristic curve changes. For our purposes and at this basic level, we are going to pay attention to only the discrimination (a) and difficulty (b) values.

Chapter 6 

■ 

Item Response Theory  

In the following item, discrimination or a is 2.0, and difficulty or b is –2.5. Item Response Function

Probability Correct

1 0.8 0.6 0.4 0.2 0

−3

−2

−1 0 1 Underlying Ability or Theta

2

3

2

3

And this item has an a value of 2.0 and a b value of 2.0: Item Response Function

Probability Correct

1 0.8 0.6 0.4 0.2 0

−3

−2

−1

0

1

Underlying Ability or Theta

We can use the values of a and b to make decisions about the worthiness of an item. And as those items are used and evaluated, they can be refined, placed once again in the test pool of items, and reevaluated until they meet the criteria we use to define good items. For example, the first item above is an “easy” item with a low level of difficulty equal to a theta of –2.5 (you don’t need to know much to have a high probability of getting this item correct) and discriminates fairly well between test takers with a fairly steep slope. On the other hand, the second item is a more “difficult” item with a difficulty level equal to 2.0 (see how the curve is placed toward the right side of the x-axis?) but doesn’t discriminate quite as well, with a slightly less pronounced slope.

121

122  Part I 

■ 

The Basics

Putting a, b, and c Together The values of a and b (and c, the guessing parameter) are mostly used in the assessment of how well items are working and which need to be revised. The steps for the creation of a test consisting of many items and using IRT as a tool would be as follows: 1. Items are created by the test developer. 2. Items are included on a test. 3. The test is administered to a very large representative sample. 4. IRT is used to evaluate the usefulness of each item, and how well it “works.” 5. If necessary (which it almost always is), the test item is refined, rewritten, redrafted, and so on, until there is another opportunity to use the item and revise as necessary according to the new values of a, b, and c. When is this process of item (and test) development complete? When each item fits the difficulty and discrimination level that the test author feels adequate. This is a subjective decision because how well an item works, in terms of reliability and validity, depends on the ability level of the individual test taker. So, an item might work well for high-ability people but not as well for low-ability people. There is another item characteristic curve function that assesses the amount of information revealed through an IRT analysis. It’s called the information function and essentially summarizes the usefulness of an item or a test made up of a bunch of specific items. This topic is beyond the scope of our discussion here, but it is an example of how IRT has powers that classical test development methods do not; with IRT one can estimate the reliability of a group of items used together without actually collecting data on that particular test or mix of items. IRT and Computerized Adaptive Testing. Perhaps the neatest thing about IRT is its role in computerized testing or computerized adaptive testing. Knowing the difficulty and discrimination levels of a bunch of items allows the test administrator to adjust what subsequent items an individual will have to answer, based on how they have done so far. If an item is too difficult (Larry gets it wrong), the computer program can adjust which item is presented next—perhaps one that is easier. The computer program will adjust the presentation of items based on difficulty and discrimination levels to maximize the assessment of the individual’s underlying or latent ability. This means that everyone, in effect, will take a customized test (or at least respond to a somewhat different set of items). Basically, the computer testing software keeps a constant estimate of a test taker’s ability level (their theta)

Chapter 6 

■ 

Item Response Theory  

and gives them items that work best for that ability level. That estimate can change during the test administration, and the software picks different items based on that changing estimate. That’s why you and your buddy might sit down at the same time to take the GRE (the graduate school admissions test) and finish at different times and have received different items.

ANALYZING TEST DATA USING IRTPRO As we mentioned earlier, computers were and are critical to the use of IRT. The analysis simply cannot be done without the help of such devices. And as computers have become smaller, more powerful, and more affordable, the software to conduct very sophisticated analyses has become more accessible as well. Such is the case with IRT, and while there are many different programs available to conduct IRT analyses, we will show you how such an analysis can be done with IRTPRO (you can find IRTPRO from Vector Psychometric Group at https://vpgcentral.com/software/irtpro/). IRTPRO is available for Windows 10—sorry, there is not a Mac version. Your school or instructor might provide access to IRTPRO. If not, there are some free options to play around with to get a sense of Item Response Theory analyses. You can get free demo versions of these programs: WINSTEPS at https://www.winsteps.com/index.htm Facets, also at https://www.winsteps.com/index.htm BILOG, there isn’t a free demo version, but there are downloadable examples of data and analyses at https://ssicentral.com/index.php/products/bilog mg-gen/ flexMIRT at https://vpgcentral.com/software/flexmirt/ The sample we used for analysis is the result of a 20-item test taken by 500 individuals. Each item can be scored as right (1) or wrong (0). You can see a sample of the item scores in Figure 6.5, where columns are items, rows are individuals, and cell entries are correct (1) or incorrect (0) for any one item. For example, for Individual Test Taker 12, Item 3 was correct (there is a value of 1 entered in that cell). In Figure 6.6, you can see the Unidimensional Analysis dialog box, where the items were added to the analysis. Then the RUN button was clicked, and in Figure 6.7, you can see the results of the analysis. Sophisticated software programs such as IRTPRO produce a large amount of output, but for our purposes, the output shown in Figure 6.7 gives us lots of information about which items we might keep and which really need to be revised.

123

124  Part I 

■ 

The Basics

FIGURE 6.5 

  Sample Data for an IRTPRO Analysis

FIGURE 6.6 

  The Unidimensional Analysis Dialog Box

For example, in Figure 6.7, Item 1 is shown to discriminate well (a = .72) and is of about average difficulty (b = –.08). On the other hand, Item 4 does not discriminate well (a = –.05) and is very easy (b = –2.99). As the test developer, it would be your decision which items you want to keep in the item bank and, of those, which should be revised before they are used again.

Chapter 6 

FIGURE 6.7 

■ 

Item Response Theory  

  Some IRTPRO Output

Seeing Is Believing It’s enough to know what the values of a, b, and c are, but it’s even better to be able to see them as a representation of how “good” an item is. In Figure 6.8, you can see the actual IRTPRO output for an item characteristic curve (ICC) of what is a relatively decent item. Much like the “perfect” ICC curve we showed you in Figure 6.1, the curve for this item seems to fit many of those characteristics, as well as discrimination and difficulty level. In contrast, in Figure 6.9 you can see the graphical representation of an unattractive item, one that has little discrimination power, and difficulty that does not change as a function of ability.

125

126  Part I 

■ 

The Basics

FIGURE 6.8 

A Graphical Representation of a “Good” Item Group 1, Item

1.0 0.9 0.8 Probability

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

−3

FIGURE 6.9 

−2

−1

0 Theta

1

2

3

A Graphical Representation of a “Poor” Item

1.0 0.9 0.8 Probability

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 −3

−2

−1

0 Theta

1

2

3

Summary Okay, this is not the easiest material to understand, and this has been a very brief overview of what item response analysis is and how it might be used in tests and measurement practice. Item

Chapter 6 

■ 

Item Response Theory  

Response Theory’s main advantage is that it examines item responses instead of people responses; the emphasis is on the characteristics of the items and not the sources of errors emanating from people taking the test or the physical structure of the test itself. Even though you may never use it, you need to know about it, and this brief introduction (we hope) covered the basics.

Time to Practice 1.

What is IRT, and what advantages does it have over Classical Test Theory?

2. Why does IRT need a large number of respondents to be a valid way of analyzing test responses? 3. Go to the library (in person or online) and find an article that used IRT in the analysis. Then answer the following questions: a. What was the purpose of the study? b. How was IRT used? c. What conclusions did the author(s) reach given the IRT analysis?

Want to Know More? Further Readings •

Thomas, M. (2011). The value of item response theory in clinical assessment: A review. Assessment, 18, 291–307.

IRT is primarily used in the measurement of ability and achievement and not as much in the area of clinical assessment. This article reviews the use of IRT in those clinical areas and discusses how the use of IRT in clinical assessment settings may hold great promise. •

Toland, M. (2013). Practical guide to conducting an item response theory analysis. Journal of Early Adolescence, 34, 120–151.

Here’s another introduction to IRT that appears in a journal not targeted at psychometricians or tests and measurement experts but, rather, people who focus on adolescent development. Often, these reviews targeted at audiences other than the primary researchers in the field can be very helpful. •

Van Hauwaert, S. M., Schimpf, C. H., & Azevedo, F. (2020). The measurement of populist attitudes: Testing cross-national scales using item response theory. Politics, 40(1), 3–21.

The researchers are interested in whether a 6-item or 8-item scale works better to measure populism in European population. They used item response theory and test results from more than 18,000 test takers to conclude that in some ways the 8-item version worked better.

127

128  Part I 

■ 

The Basics

And on Some Interesting Websites •

Tips on IRT and how it can be applied and when it is best used can be found at https://support .sas.com/resources/papers/proceedings14/SAS364-2014.pdf, in a paper titled “Item Response Theory: What It Is and How You Can Use the IRT Procedure to Apply It.”



This online resource from the Stata software people explains a lot of IRT basics. Check it out at https://www.stata.com/features/overview/irt/.

And in the Real Testing World Real World 1 One of the most interesting things about the development and dissemination of new techniques and ideas is how these techniques spread to different disciplines. So while IRT was initially the provenance of test makers and evaluators, its use has become commonplace in a variety of disciplines. In this study, the author talks about the use of IRT within the framework of studying adolescence. Though it’s an older article, the basic IRT techniques it covers are up-to-date. Want to know more? Toland, M. (2004). Practical guide to conducting an item response theory analysis. Journal of Early Adolescence, 34, 120–151.

Real World 2 Item Response Theory isn’t only used to make tests that measure concrete objective knowledge, it can be used for more abstract traits like creativity! Indonesian researchers measured creative thinking among high school students in a variety of aspects. Among other differences, they found that females were better at developing new ideas and males were better at criticizing new ideas. Want to know more? Istiyono, E., & Hamdi, S. (2020). Measuring creative thinking skills of senior high school male and female students in physics (CTSP) using the IRT-based PhysTCreTS. Journal of Turkish Science Education, 17(4), 578–590.

Real World 3 When it comes to assessing adolescents’ problem behaviors, it matters who you ask. The person who is reporting the behaviors (as in, responding to questions on a scale) might be the adolescent or a parent or a teacher, right? Researchers in Japan used a test of child behaviors (Strengths and Difficulties Questionnaire) and compared parent and child responses to see if they agreed. IRT was used in the analysis. When the behaviors are directly observable (e.g., fighting or stealing) parents and students tended to agree, but when they were “internalizing,” unobservable behaviors (e.g., being reflective or getting angry), there was disagreement. Want to know more? Iwata, N., Kumagai, R., & Saeki, I. (2020). Do mothers and fathers assess their children’s behavioral problems in the same way as do their children? An IRT investigation on the Strengths and Difficulties Questionnaire. Japanese Psychological Research, 62(2), 87–100.

Y

ou just finished learning the basics of tests and measurement—cool concepts, big ideas, and a lot of math. So now it’s time to talk about all the things in the real world that we want to measure and the variety of interesting ways we can choose to measure them.

PART II

TYPES OF TESTS

The rubber really meets the road when we start talking about what areas of behavior, growth, learning, human performance, and what have you (see the What Have You Test from Salkind and Frey, Inc.) that we want to measure. Perhaps we want to find out how aggressive someone is (personality), how “smart” they are (ability), or for what career their skills and interests would be best suited (career development). Now it’s time to turn our attention to different types of tests and what they do. The best way to do this is to tell you a bit about what kinds of tests there are (such as achievement and personality tests), what they measure, and how they are constructed. This information won’t make you the newest psychometric tsar on your block or in your dorm, but what it will do is give you a very basic and pretty broad overview of just what kinds of tests are out there, what they look like, and how they work. And, because you are starting to get pretty smart about this stuff, for each of these types of tests, we will take a special look at how the concepts of validity and reliability apply to them. Each time we do this, we will remind you that validity is a unitary concept, and all the types of validity arguments and evidence can come into play in determining the validity of a measure, but certain validity approaches seem particularly important for different tests with different purposes. Reliability works the same way; there are different ways to estimate how much randomness there is in the scores from a test, and all approaches give us some good information, but the ways different tests are scored makes some types of reliability concerns more important than others. So, look for the Validity and Reliability sections in each chapter.

129

130   Tests & Measurement for People Who (Think They) Hate Tests & Measurement

Here’s the simple skinny on what each type of test does. • Achievement tests (which we cover in Chapter 7) assess knowledge and previous learning. • Aptitude tests (which we cover in Chapter 8) test the potential to learn or acquire a skill. • Intelligence and other ability tests (which we cover in Chapter 9) assess what is sometimes called IQ and other mental abilities. • Personality and other psychological tests (which we cover in Chapter 10) measure a person’s enduring characteristics and disposition. Chapter 10 also talks about neuropsychological assessment and measures of cognitive functioning, including brain damage and brain disease. • Career choice tests (which we cover in Chapter 11) cover the types of tests that are taken to determine levels of interest in various occupations and the skills associated with those occupations. The goal of these next five chapters is to increase your skill and understanding for designing or using tests effectively. (And that, we think, is the very reason you’re enrolled in this class!)

7 ACHIEVEMENT TESTS Is Life a Multiple-Choice Test? Difficulty Index ☺ ☺ ☺ (moderately easy)

LEARNING OBJECTIVES After reading this chapter, you should be able to • Describe how achievement tests are used. • Explain how criterion-referenced tests and norm-referenced tests are interpreted. • Provide the steps for the development of standardized tests. • Create a table of specifications. • Compare and contrast several major achievement tests. • Consider the validity and reliability of achievement tests.

Charles Schultz, creator of the comic strip Peanuts, with Charlie Brown and Snoopy and the gang, is credited with writing, “Sometimes I lie awake at night and I ask, “Is life a multiple choice test or is it a true or false test?” . . . Then a voice comes to me out of the dark and says, “We hate to tell you this but life is a thousand word essay!” For some of us, this feeling really hits home as we recall the anxiety of school tests and those standardized “high-stakes” achievement tests like the SAT and the ACT. And it highlights the many ways the world has designed to measure learning.

131

132  Part II 

■ 

Types of Tests

HOW ACHIEVEMENT TESTS ARE USED By the time you get to high school, and even more so college, you have probably taken thousands of achievement tests. They all test how much you know. And for the most part, achievement tests are administered in educational (school) or training institutions. Achievement tests measure how much someone knows or has learned—how much knowledge an individual has in a particular subject area, be it mathematics, reading, history, auto mechanics, biology, or culinary science. Achievement tests are almost certainly the most common type of test you’ve taken in your life. More of these are administered each year than any other type of test—by far. The category of achievement tests not only includes those standardized college admission tests, those end-of-the-year “state tests” that many of us took in middle and high school, but also describes most of the classroom assessments you were subject to—spelling tests, pop quizzes, term papers, final exams, and so on. Classroom assessment, though, we cover separately in Chapters 12 and 13. Achievement tests have some very well-defined purposes. You already know that the first purpose is to measure or assess how much is known about a certain topic. But other purposes are equally important, and as you will see, these do overlap with one another. • Achievement tests help define the particular areas that are important to assess. They help pinpoint those topics that are important and those that are not. This can be done through what is called a table of specifications, which we will talk about later in this chapter. • Achievement tests indicate whether an individual has accomplished or achieved the necessary knowledge to move to the next step in study. Passing such a test might be a prerequisite to move on to a more advanced course. Perhaps at your school you can “test out” of Calculus 1 or Biology 101 by taking a test designed just for that purpose—to see if you have the prerequisite knowledge to move on. • Achievement tests can allow for the grouping of individuals into certain skill areas. If there is an accurate assessment of a student’s skill, instruction can be targeted more precisely at current levels of achievement and further achievement can be facilitated. • Achievement tests may be used diagnostically in that they help identify weaknesses and strengths. Once a test taker’s weaker areas are identified, it’s so much easier to help that individual by targeting those specific areas for remediation.

Chapter 7 

■ Achievement

• Finally, achievement tests can be used to assess the success of a program on a school- or district-wide basis. Such an assessment or evaluation can help teachers, administrators, and trainers figure out where they are being successful and what areas or strategies might need improvement. Achievement tests are pretty similar in what they do: They assess knowledge. But they differ in some important ways as well. Here’s how.

Teacher-Made (or Researcher-Made) Tests Versus Standardized Achievement Tests Teacher-made tests are constructed by a teacher, and the effort placed on establishing validity or reliability, norming, or the development of scoring systems varies from nonexistent to thorough. The midterm you took in your introductory psychology class was probably a teacher-made, multiple-choice test. There’s nothing at all wrong with teacher-made tests; they are just very situation specific and defined to suit a particular need. The same is true of researcher-made measures— some researchers spend a lot of energy validating their instruments, others don’t think twice. A standardized test is one that has undergone extensive test development—meaning the writing and rewriting of items; hundreds of administrations; development of reliability and validity data; norming with what is sometimes very large groups of test takers (for example, upward of 100,000 for the California Achievement Tests); development of consistent directions, administration procedures, and very clear scoring instructions. Technically, a standardized test is any test that is administered under a standard set of conditions and rules. That includes, when you think about it, the fourth-grade math test that Mrs. Nelson gave Bruce when he was 9, but in practice, when we say standardized test, we mean those large-scale professionally built tests that are used to make important decisions about individuals. Most achievement tests that are administered in school for the purposes we listed earlier are standardized. Standardization is a very long, expensive, and detailed process, but it is the gold standard for creating achievement tests that are reliable, valid, and appropriate for the population of test takers being assessed. Standardized tests such as the Iowa Test of Basic Skills (born in 1935) are also usually published by a commercial establishment (in this case, Houghton Mifflin Harcourt). Standardization mostly has to do with the way the tests are administered. The majority of achievement tests are administered in a group setting (such as in a classroom), although some achievement tests are given individually (especially those that are used for diagnostic purposes or those that are used to evaluate the status or progress of individuals who are being assessed to see if they would benefit from special education programming).

Tests  133

134  Part II 

■ 

Types of Tests

CRITERION- VERSUS NORM-REFERENCED TESTS Achievement tests can be norm-referenced or criterion-referenced. Norm-referenced tests (a term coined by psychologist Robert Glaser in 1963) allow you to compare one individual’s test performance with the test performance of other individuals. For example, if an 8-year-old student receives a score of 56 on a mathematics test, you can use the norms that are supplied with the test to determine that child’s placement relative to other 8-year-olds. Standardized tests are usually accompanied by norms, but this is rarely the case for teacher- or researchermade tests. Criterion-referenced tests (thank you, Dr. Glaser, once again) define a specific criterion or level of performance, and the only aspect of importance is the individual’s performance, regardless of where that performance might stand in comparison with others. In this case, performance is defined as a function of mastery of some content domain. For example, if you were to specify a set of objectives for 12th-grade history and specify that students must show command of 90% of those objectives to pass, then you would be implying that the criterion is 90% mastery. Norm-Referenced or Criterion-Referenced Interpretations? There’s an old joke (that we have used elsewhere in another book, but we love it enough to tell it again). Bruce and Neil are walking in the woods and a grizzly bear jumps out and starts chasing them. They both run as fast as they can to avoid getting eaten. “Run as fast as you can! You need to run very fast to escape!” Bruce yelled. “No, I don’t,” said Neil. “I only have to run faster than you!” Notice that Bruce believes that success is criterion-referenced, but Neil understands that, for this test, success is norm-referenced. It is more and more common for standardized tests to be criterion-referenced in their approach and in their scoring. It is not easy to make a good criterionreferenced measure. To begin with, there is always the issue of what is meant by the word criterion (the same problem as with criterion validity, by the way). More often than not, the criterion is a cut score or a particular score; if the test taker exceeds it, then the criterion is met. If that score is not exceeded, then the criterion is not met. Sort of a binary, 1–0, pass–fail arrangement. But to help make this even more complex, the cut score is not “getting 95% correct on the test” but, rather, something like “understanding the role of the underground railroad in the contentious debate about slavery in pre–Civil War America,” and one would expect someone to answer 80% or 90% of the questions about this topic correctly.

Chapter 7 

■ Achievement

And you can imagine the differences in opinion about what constitutes an adequate criterion. If driving a car is the criterion, then we know (and can almost all agree) that parallel parking, signaling before a turn, and other basic skills should be part of such. But should a high school senior know beginning, intermediate, or advanced algebra, and what skills associated with that knowledge? These are all subjective judgments that may be easy or difficult to defend. Next, the use of criterion-referenced tests also fits nicely into the ever-changing and ever-moving landscape of testing that is defined as high stakes (see Chapter 16), where a lot rides on success or failure. Here, passing or failing (if that’s the criterion associated with a specific cut score) may be related to funding, recognition, teachers’ salaries, and other important outcomes related to test performance. You’ve probably heard about high-stakes tests and likely have even taken some. A high-stakes test is one where the results are used to make important decisions about students, teachers, schools, and almost any element of the broadly defined educational system. A high-stakes test does not differ from a low-stakes test in format or even subject matter. What counts is the way the results are used. A low-stakes test result might be used to assign a weekly grade in a classroom (or to record your attitude toward the pizza you just ordered), while the results from a high-stakes test might be used to determine if a child moves on to the next grade, whether a police officer gets certified, or if a medical student passes their national boards. The big deal here is that in spite of movements to get away from overtesting, high-stakes tests are still administered, and these scores are used to make important decisions. And finally, there are many different ways to score a criterion-referenced test. For example, any one of the following may be used: • Checklists • Rating scales • Grades • Rubrics • Percentage accurate But the method of scoring that is chosen has implications for how the final test score might be used. For example, grades are more or less a fairly universal indicator of performance, and grade point average (GPA) becomes a measure of success in school (whether used correctly or not). And we know that most grades have a subjective component built into them—such as “Did the student improve over time?” or “The final is weighted more than any other exam during the semester.”

Tests  135

136  Part II 

■ 

Types of Tests

And rating scales and checklists all have subjective components that introduce even more human error (and less reliability, remember), so these tests need to be used and graded very carefully. Criterion-referenced tests use that one measure of “success,” but what that criterion is can be very different for different tests. For example, on an achievement test, it might be a certain percentage of correct answers. But the criterion can also be how fast a certain number of items can be completed, how accurate the individual’s performance is, or the quality of the responses. The specific criterion used, of course, depends on the goal of the test and the assessment objectives.

HOW TO DO IT: THE ABCS OF CREATING A STANDARDIZED TEST Warning—this is a brief overview of a very time-intensive, expensive, and demanding experience. Understanding these steps will surely help you appreciate how refined and well planned these tests are and, to a large extent, why you can have a high degree of confidence in them when you participate in their administration. So although you may not go into the business of creating achievement tests, you should have some idea of how it’s done so you can appreciate the amount of work and the intensity of the experience. A more informed consumer makes a better practitioner. With a goal of ending up with a valid and reliable test, there is a common set of steps that most developers of standardized achievement tests follow. The particular framework that follows is based on the work of Gary J. Robertson who presented a practical model for achievement test development in the Handbook of Psychological and Educational Assessment of Children: Intelligence and Achievement (1990). First, there is the development of preliminary ideas. This is the stage where the test developer (perhaps a professor like the one teaching this class) considers the possible topics that might be covered, the level of coverage, scope, and every other factor that relates to what may be on the finished test. For example, perhaps high school–level biology or elementary school–level reading. Second, test specifications should be developed. This is a complex process that allows the test developer to understand the relationship between the level of items (along one of many dimensions) and the content of the items. We’ll show you more about developing specifications in the next section of this chapter. These (usually) tables are typically created based on curriculum guides and other information that informs the test developers as to what content is covered in what grades.

Chapter 7 

■ Achievement

Third, the items are written. In most cases, a large pool of item writers (who are experts in the content area—remember content validity?) are hired and asked to write many, many items. Those items that follow good item-writing rules (more about them in Chapter 12) are retained. These items are then reviewed for bias and validity across cultures. Fourth, the items are used in a trial setting. Instructions are drafted, participants are located, actual preliminary tests are constructed, items are tried, and items are analyzed (remember Item Response Theory from Chapter 6?). This is a l-o-o-o-ng process and very expensive. At the end of this stage, these items are reviewed and the ones that don’t “work well” are revised or discarded. Questions that don’t work well might be identified based on Classical Test Theory reliability, their difficulty, Item Response Theory characteristics (see Chapter 6), and so on. Fifth, the test developers rewrite items that did not perform very well (with answer options that no one chooses, for example) but still hold some promise for being included. A new item may be written to take this item’s place, or those not-yet-acceptable items may be revised. This step in the development of a standardized test is over when the test developers feel that they have a set of items that meets all important criteria to move forward. Sixth, the final tests are assembled, but it’s not soup yet. Seventh, an extensive national standardization effort takes place that includes selection of more participants, preparation of the materials, administration of the tests, analysis of the data, and development of norm tables. Finally, the eighth step involves the preparation of all the necessary materials, including a test manual, the preparation of the actual test forms, and printing. Phew! These eight steps not only take a lot of time (it can be years and years) and cost a lot of money, but because the nature of knowledge changes so often (what’s expected of sixth graders in math, for example, or the newest mechanical specs for low-pollution engines), these tests often have to be rewritten and renormed, and so forth—lots and lots of work. Consumers demand high levels of validity and reliability for these highstakes tests, though (as they should), so these iterative steps are necessary. There are a few more steps we could throw in, including what occurs if the test developers want to publish the test. Because creating the test is such an expensive undertaking, publishing it usually involves including a for-profit company in the process. And more often than not, these publishers are involved very early in the game and provide funding for the test developers to take time off from their other teaching and research responsibilities. For this funding, of course, the publisher gets to sell the test and keep the majority of the profits.

Tests  137

138  Part II 

■ 

Types of Tests

IRT or Not IRT? Scoring is scoring is scoring, right? Well, sort of. If a student gets a score of 89 out of 100, then they got 89% correct—and that is “sort of” the standard way in which achievement tests have been scored (and then this raw score can be transformed into a standardized score like in Chapter 5, adjusted for norms, and so on). But there’s another way of looking at scores on achievement tests. Item Response Theory (IRT), which we reviewed in Chapter 6, derives scores based on patterns of responses. Most simply, this means that two students might get the same number of items correct on a test, but because one student may have correctly answered harder questions than the other, they may receive different “scores.” Want to know more about IRT? Start by checking out or reviewing Chapter 5, and then take a few more measurement classes. It’s complex but fascinating stuff. IRT is very “science-y.”

THE AMAZING TABLE OF SPECIFICATIONS The table of specifications is a clever idea and a terrific guideline when it comes to the creation of any kind of achievement test—be it teacher made or standardized. In the simplest terms, a table of specifications is a grid (with either one or two dimensions, and we’ll get to that soon) that serves as a guide to the construction of an achievement test. As you can see in what follows, one of the axes can represent various learning outcomes that are expected or the areas of content that are to be covered. The other axis can represent some dimension that reflects the different levels of questions, and this can be anything from the amount of time spent on the topic in the classroom to a very cool taxonomy of educational objectives, such as the one proposed by Benjamin Bloom. Here’s a very simple table of specifications for a midterm in a tests and measurement course (like the one you might get!). Amount of Time Spent in Class

Topic

Test Items

Measurement scales

 35%

1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 13, 14, 15, 16, 17, 18, 20

Reliability

 15%

9, 12, 19, 21, 22, 23, 24, 28

Validity

 20%

25, 26, 27, 29, 30, 31, 32, 33, 34, 35

Writing short-answer items

 15%

36, 37, 38, 39, 41, 42, 44

Creating true–false tests

 15%

40, 43, 45, 46, 47, 48, 49, 50

Total

100%

Chapter 7 

■ Achievement

In this case, the instructor decided that the amount of time spent in class on a topic was a fair way to decide how much of a test’s content should reflect that topic (and this would be their content-based validity argument if they needed one). Once this grid is created, the instructor knows that 15% of all 50 test questions (and exactly what questions) should deal with reliability and 20% should deal with validity, which also would reflect the amount of time the instructor spent on these topics in class. Then, the questions that are in the particular area of interest are listed as well so the instructor knows which questions relate to which topic area and can be sure the percentages are correct. In general, there are a couple of ways that tables of specifications are usually designed. Often, they are based on the amount of time (or readings or class activities) spent on a particular topic within a course. And the amount of time is reflected in the percentage of questions that appear. Another popular approach is to decide which topics or ideas are most important and have the test weight those topics more. Commercially produced standardized achievement tests like “state tests” usually weight the number of items from each area based on official state standards and objectives. Okay, that’s a simple table of specifications. A more sophisticated one would look at topic areas and learning outcomes. In 1956, Benjamin Bloom created his taxonomy of learning outcomes or objectives that can be classified into six different categories of abstraction, from the most basic factual level (knowledge) to the most abstract and cognitively sophisticated level (evaluation). The six levels, from least sophisticated to most sophisticated, are as follows: 1. Knowledge 2. Comprehension 3. Application 4. Analysis 5. Synthesis 6. Evaluation This is Bloom’s classic original list of different levels of understanding. (In the current generation of educators, these levels have been tweaked a bit. Some teachers call the knowledge level, memorized knowledge or remembering, and sometimes the top level is replaced with creativity, and so on.) In Table 7.1, you can see the levels and key words summarized. That is followed by brief summaries of each level and some of the key words you might expect to see in achievement-test questions at each of these levels.

Tests  139

140  Part II 

■ 

Types of Tests

TABLE 7.1 

 The Six Levels of Bloom’s Taxonomy and Key Words to Look for in Achievement Test Items at Those Levels

Level of Bloom’s Taxonomy

Key Words to Look For

Knowledge

List, define, tell, describe, identify, show, label, collect, examine, tabulate, quote, name, who, when, where

Comprehension

Summarize, describe, interpret, contrast, predict

Application

Apply, demonstrate, calculate, complete, illustrate, show

Analysis

Analyze, separate, order, explain, connect

Synthesis

Combine, integrate, modify, rearrange, substitute

Evaluation

Assess, decide, rank, recommend, convince

So an item such as “List all the reasons why reliability is important” would be categorized at the lowest level of understanding, the knowledge level. Conversely, an item such as “Convince your instructor why IRT is the best way to create and select items for use on an achievement test” would be placed at the highest level, evaluation, because it requires very deep understanding. 1. Knowledge-level questions focus on, for example, the recall of information; a knowledge of dates, events, and places; and a knowledge of certain major ideas. All these reflect a broad base of facts. Knowledge-based questions might contain words such as list, define, tell, describe, identify, show, label, collect, examine, tabulate, quote, name, who, when, and where, all, of course, in question format. 2. Comprehension-level questions focus on the understanding (not just the knowledge) of information and require the test taker to interpret facts, compare and contrast different facts, infer cause and effect, and predict the consequences of a certain event. Comprehension-based questions might contain words such as summarize, describe, interpret, contrast, and predict. 3. Application-level questions require the use of information, methods, and concepts, as well as problem solving. This level of understanding means students can solve problems they have never seen before. Applicationbased questions might contain words such as apply, demonstrate, calculate, complete, illustrate, and show. 4. Analysis-level questions require the test taker to look for and (if successful) see patterns among parts, recognize hidden meanings, and identify the parts of a problem. Analysis-based questions might contain words such as analyze (surprised?), separate, order, explain, and connect (as in associating two ideas).

Chapter 7 

■ Achievement

5. Synthesis-level questions require the test taker to use old ideas to create new ones and to generalize from given facts. Synthesis-based questions might contain words such as combine, integrate, modify, rearrange, and substitute. 6. Evaluation-level questions require that the test taker compare and discriminate between ideas and make choices based on a reasonable and well-thought-out argument. Evaluation-based questions might contain words such as assess, decide, rank, recommend, and convince. That’s what you (or your friend) did on debate teams in high school. Evaluation questions do not lend themselves easily to the multiple-choice format. One has to be a very skilled multiple-choice question writer to create questions at this level. As you read through these levels, I am sure you will see how much more sophisticated a question is at the higher levels. For example, knowledge-level questions (Level 1) ask for nothing more than memorization, whereas evaluation-level questions (Level 6) require full knowledge of a topic, an understanding of the relationships between ideas, and the ability to integrate and evaluate them all. For example, here’s a multiple-choice, knowledge-based question, the least sophisticated of all six levels: 1. What are the ingredients in a roux? A. Flour, liquid, and salt B. Flour, salt, and mushrooms C. Flour, liquid, and shortening D. Flour, shortening, and ground nuts For all you culinary fans, the answer is C, and a roux is used to thicken a sauce, such as a white sauce. Here’s an example (on the other end of the continuum) of an evaluation-based question, the most sophisticated of all six levels: Create a recipe where a roux is a central part of the finished dish, and explain when a roux is essential for a successful sauce. Earlier we showed you a simple table of specifications. What follows is a much more sophisticated one that doesn’t focus on the amount of time spent on a particular topic but instead focuses on the level of Bloom’s taxonomy. The same measurement topics are being taught, but this time, the items and the various levels of Bloom’s taxonomy that they represent are shown as table entries.

Tests  141

142  Part II 

■ 

Types of Tests

Level of Taxonomy Topic Measurement scales Reliability Validity Writing shortanswer items Creating true–false tests

Knowledge

Comprehension

Application

Analysis

Synthesis

Evaluation

1, 2, 3, 4, 5 10%

6, 7, 8, 10, 11 10%

13, 14 4%

15, 16 4%

17, 18 4%

20 2%

9, 12 4%

21, 22 4%

23, 24 4%

19 2%

25, 26, 27 6%

29, 30, 31 6%

32 2%

36, 37 4%

38, 39 4%

41, 42, 44 6%

40, 43, 45, 46 8%

47, 48 4%

49 2%

28 2% 33, 34 4%

35 2%

50 2%

As you can see, there are no synthesis or evaluation questions for the topics of writing short-answer questions or writing true–false items. In this case, the instructor did not have learning objectives at this level, and the test, as it should, does not reflect such. Who’s Literate? This surely is the age of information. No longer do students (like you) need to go to the library to check out books on the topic of an assigned research paper. These days, everything can be accessed online. Now, to assess how “literate” students are in their use of electronic resources, we have the Information and Communication Technology Literacy Assessment from our friends at the Educational Testing Service (who also bring us the SAT and GRE tests). This test measures students’ ability to manage exercises such as sorting email messages or manipulating tables and charts and also assesses how well they organize and interpret information. It’s only one of many signs that we are leaving the industrial age for the information age.

WHAT THEY ARE: A SAMPLING OF ACHIEVEMENT TESTS AND WHAT THEY DO You’re taking this class to learn about tests and measurement and not necessarily to become a psychometrician (one who designs and analyzes tests), but still it’s good to be familiar with what some of the most popular and successful achievement tests are.

Chapter 7 

■ Achievement

As part of each chapter from this one through Chapter 11 on career choices, we’ll be providing you with an overview of some of the most common tests used (at least in the United States) over the past 50 years and still very much in use today. You can see the set of tests for this chapter in Table 7.2. Also note that most tests are now available online and even in mobile editions. So not only are they more convenient and easily available to take, but scoring has also been greatly simplified. Often more is better, but beware the ease with which online tests can be administered and scored. Such convenience does not release the test giver from having a very good understanding of what the test does and how the final scores should be interpreted. Sometimes the ease with which we can test gets the better of our judgment and we no longer use these tests as tools, as we have described them in earlier chapters in this book, but as quick answers to complex questions. Beware! As you continue your education, you are bound to run into these tests in one setting or another. And now you’ll know something about them—isn’t school great?

VALIDITY AND RELIABILITY OF ACHIEVEMENT TESTS Validity is a unitary concept, and all types of validity arguments and evidence can come into play in determining validity of a measure, but certain validity approaches seem particularly important for different tests with different purposes. Achievement tests are almost always created using a table of specifications or some other structured blueprint of what topics should be covered and that table is based on standards or expert advice, some accepted set of criteria for what questions should be on the test. So, it is content-based validity that is usually most important when validating an achievement test. In terms of reliability, the scoring for the standardized achievement tests we are focusing on in this chapter is usually, literally, done by a computer, so the kind of random scoring errors that occur when humans are using judgment to assign points isn’t possible. Consequently, interrater reliability is not an issue. (Some achievement tests have an essay portion that is scored by humans, and they DO do a lot of training and research to keep high interrater agreement among those humans.) These sorts of tests usually rely on internal consistency estimates of reliability, that is, coefficient alpha. And because these tests are long with a gazillion questions, they almost always have very high reliability—we are talking in the .90s.

Tests  143

144 College juniors and seniors making application to graduate school

“Designed to assess the verbal, quantitative and analytical reasoning abilities of graduate school applicants.” Verbal reasoning, quantitative reasoning, analytical reasoning

Graduate Record Exams– General Test (GRE)

Grade Levels/Ages Tested

“Designed to measure K–12 for the achievement in the basic Complete skills taught in schools Battery throughout the nation.” Reading (visual recognition, word analysis, vocabulary, comprehension); spelling, language (mechanics, expression); mathematics (computation, concepts, and applications); study skills; science; social studies

Purpose and What It Tests Conceptual Framework

Seven sections that take 30 minutes each, with two verbal, two quantitative, and two analytical sections, and the final section reserved for pretesting new questions and gathering information used in research

2. Practice tests are available so that students, especially those in the lower grades, can have some idea of how these tests work and what is expected of them.

1. The developers of the test have created a scoring system such that each student can be scored in the six types of thinking processes. So the teacher assesses not only achievement but also the process that goes into thinking about the items on the test. Each student has a set of Integrated Outcome Scores that reflects these thinking processes.

What’s Interesting to Note

The verbal reasoning section, among 1. Very high security practices make it other things, consists of such skills as difficult to cheat (cool for the exam analyzing and drawing conclusions, giver anyway). distinguishing major from minor 2. This is the most popular and bestarguments, and understanding the documented test of its kind, and meanings of words. The quantitative virtually everyone who applies to sections, among other things, assess graduate school takes it. the ability to understand quantitative information, solve mathematical problems, and apply basic mathematical skills and concepts. The section on Analytical Writing examines, among other things, the ability to support ideas with relevant evidence and articulate complex ideas.

6. Evaluating outcomes

The TerraNova The test developers organized comes in two items to reflect six types of thinking versions. One is processes: the Survey, which 1. Gathering information contains 20 items 2. Organizing information per subset, and one is the Complete 3. Analyzing information Battery, which 4. Generating ideas contains 24–50 items per subtest. 5. Synthesizing elements

Versions

  A List of Some Widely Used Achievement Tests

TerraNova

Title/ Acronym (or What It’s Often Called)

TABLE 7.2 

145

“To provide a comprehensive assessment of student progress in the basic skills.” The core battery is composed of sections for listening, word analysis, vocabulary, reading, language, and mathematics. The complete battery adds social studies, science, and sources of information. A writing assessment and a listening assessment are available.

The GED was designed to “assess skills representative of the typical outcomes of a traditional high school education.” Areas tested are writing skills, social studies, science, literature and the arts, and mathematics.

Iowa Assessments (also known as the Iowa Test of Basic Skills, or ITBS)

Tests of General Education Development (GED) No level specified

K–8

One

Forms A, B, and C

Like many other achievement tests, the GED uses an adaptation of Bloom’s “Taxonomy of Educational Objectives” levels, with items mainly from the categories of comprehension, application, and analysis.

3. New additions include a writing sample, an emphasis on critical thinking, items related to computer technology, and the assessment of consumer skills in common adult settings.

2. In 1988, more than 700,000 (yikes!) tests were administered with more than 70% of takers passing.

1. First developed in 1942 to assist veterans who did not have the time to complete high school.

3. This is one test where the authors really tried to base the content and presentation on good curricular planning. For example, the ITBS helps teachers determine which students have the knowledge and skills needed to deal successfully with the curriculum, and it provides information to parents that will enable home and school to work together to best fit the student’s interests.

The ITBS samples those fundamental 1. One of the first in that it was first skills that are necessary for a administered in 1935 as the Iowa student to make satisfactory Every Pupil Test of Basic Skills. progress through school. This 2. Lots of extra information is includes higher-order thinking skills available to test givers, making test such as interpretation, classification, administration and interpretation comparison, analysis, and inference. much more meaningful.

146  Part II 

■ 

Types of Tests

Summary Achievement tests are the first kind of test you’ve learned about in Tests & Measurement for People Who (Think They) Hate Tests & Measurement, and they’re also the type of test you are most likely to encounter both as someone taking a test and as someone giving a test. Achievement tests focus basically on knowledge, are constructed using a variety of item formats (that you’ll learn about in Chapters 12 and 13), and can be used as diagnostic, remedial, or just assessment-type tools. In almost every way, they can be powerful allies in the learning process.

Time to Practice 1. Why is the development of any standardized test so expensive? 2. Find a research article that uses a standardized test, and then answer these questions: a. What is the name of the test? b. To what purpose was the test used in the research you are reading? c. How do you know the test is reliable and valid? d. If the test is reliable and valid, why is the test the appropriate one for the purpose of the research? 3. Interview one of your current or past teachers and see if you can get them to share what their test practice development philosophy is like. Ask them about how they decide which questions to ask. What do they do if a particular question is not very good (for example, it does not accurately discriminate or it is too easy)? Don’t forget to take good notes. 4. Draw up a simple table of specifications for the area in which you are studying using several topics as one dimension and the degree of difficulty (easy, medium, and hard) for the other. 5. Give an example of how an administrator would score a criterion-referenced test. How would scoring differ if the same test were changed to a norm-referenced test? 6. Which of the following are examples of achievement tests? a. A midterm covering material taught in the first half of the course b. A test to determine whether a student can test out of Spanish I and enroll directly in Spanish II c. A test assessing where sixth-grade students lie on a continua of characteristics such as optimism, shyness, openness to new experiences, and so forth d. A standardized test to assess whether the fifth-graders at a school are meeting district standards for curriculum goals e. An algebra test that assesses ability to solve inequalities, linear equations, and quadratic equations

Chapter 7 

■ Achievement

Tests  147

7. What level of Bloom’s taxonomy would each of the following questions assess? a. Define the term heart palpitation. b. Calculate the mean of the following group of numbers: 4, 7, 13, 21, 22, 30. c. Assess the following short-answer response, identifying any strengths and weaknesses you see in the student’s ability to accurately explain the concept. d. Contrast the concept of bravery with the concept of courage. e. Who is credited with coining the term learned helplessness? f. On what date did the Imperial Japanese Navy attack Pearl Harbor?

Want to Know More? Further Readings •

Bowker, M., & Irish, B. (2003). Using test-taking skills to improve students’ standardized test scores (Master’s thesis). Retrieved from https://eric.ed.gov/?id=ED481116

This is a nice research project where a program was developed to improve test-taking skills to increase standardized test scores. The treatment? Simple—preparation for taking standardized tests. Students showed an improvement in concentration during tests, and students’ scores increased. •

Sulak, T. (2016). School climate and academic achievement in suburban schools. Education and Urban Society, 48(7), 672–684.

A lot goes on at school in addition to the teacher teaching and the students studenting. And one of those important “other” areas is school climate. This study examined the predictive value of suburban school climate on academic achievement in a nationally representative sample of suburban campuses. The outcome? Behavior and surrounding crime rate may be factors in suburban academic achievement.

And on Some Interesting Websites •

Fair Test—The National Center for Fair and Open Testing at http://www.fairtest.org/ has a mission to “end the misuses and flaws of standardized testing and to ensure that evaluation of students, teachers and schools is fair, open, valid and educationally beneficial.” A really interesting site to visit.



“Preparing Students to Take Standardized Achievement Tests” (at https://scholarworks. umass.edu/pare/vol1/iss1/11/) was written by William A. Mehrens (and first appeared in Practical Assessment, Research & Evaluation) for school administrators and teachers and discusses what test scores mean and how they can be most useful in understanding children’s performance.

148  Part II 

■ 

Types of Tests

And in the Real Testing World Real World 1 How interesting it is to look at other variables and see how they relate to scores on achievement tests. In this study, sixth- and eighth-grade urban middle school students’ achievement test scores were examined. Significant differences were found in achievement for socioeconomic status (SES) and participation in a music ensemble. Higher SES students scored higher. Sixth-grade band students scored significantly higher than choir students and nonparticipants on every achievement test. An interesting conclusion? Band may attract higher-achieving students from the outset and test score differences remain stable over time. Want to know more? Kinney, D. W. (2008). Selected demographic variables, school music participation, and achievement test scores of urban middle school students. Journal of Research in Music Education, 56, 145–161.

Real World 2 Since summer vacation became a regular part of school life, teachers and parents (and even students!) have wondered what school knowledge and skills are lost over that 2- to 3-month period. A review of 39 studies indicated that achievement test scores decline over summer vacation. Summer loss equaled about 1 month on a grade-level equivalent scale, and the effect of summer break was more detrimental for math than for reading and most detrimental for math computation and spelling. How about year-round school? How about not? Want to know more? Cooper, H., Bye, B., Charlton, K., Lindsay, J., & Greathouse, S. (1996). The effects of summer vacation on achievement test scores: A narrative and meta-analytic review. Review of Educational Research, 66, 227–268.

Real World 3 Gender difference is one variable that researchers always look to (rightly or wrongly) to help explain differences in performance, and it is always one of great interest to explore. Gender differences on multiple-choice and constructed-response science tests were examined, and differences were greatest for items that involved visual content and called on application of knowledge commonly acquired through extracurricular activities. The researchers found that conclusions about group differences and correlates of achievement depend heavily on specific features of the items that make up the test, as opposed to some sort of general ability differences between genders. Want to know more? Hamilton, L. S. (1998). Gender differences on high school science achievement tests: Do format and content matter? Educational Evaluation and Policy Analysis, 20, 179–195.

8 APTITUDE TESTS What’s in Store for Me? Difficulty Index ☺ ☺ ☺ ☺ (moderately easy)

LEARNING OBJECTIVES After reading this chapter, you should be able to • Describe how aptitude tests are used. • List the steps for the development of aptitude tests. • Differentiate between different types of aptitude tests. • Compare and contrast the major aptitude tests. • Apply the concepts of validity and reliability to aptitude tests.

W

hat’s the coolest aptitude test out there? By far, it’s the GLAT (the Google Labs Aptitude Test), which you can find (and take) at http://googleblog.blog spot.com/2004/09/pencils-down-people.html. Take it and send it on to the Google employment folks. Score high enough and they may call you about a job. We’ll show you an item from this test later in the chapter. What’s an aptitude test? Aptitude tests evaluate an individual’s potential—they indicate what someone may be able to do in the future—and do not reflect their current level of performance. It’s what you may have to take when you apply for a job as a drafting technician or for an advanced degree. Aptitude tests attempt to find out how qualified you are for what might lie ahead. Your past

149

150  Part II 

■ 

Types of Tests

accomplishments may, of course, influence your future ones, but an aptitude test tries to quantify how well prepared you are for that future. So, if you can answer this question from the GLAT, you may be well on your way to a Google future. And here it is:

1 1   1 2    1 1    2    1    1 1    1    1    2    2    1 What is the next line of numbers? _______________ Want the answer? Take the test! We ain’t telling!

WHAT APTITUDE TESTS DO Aptitude tests come in at least two flavors: They can assess cognitive skills and knowledge, as on the SAT (which is not an acronym—see the box—and focuses on academic knowledge and skills in areas such as reading and mathematics) or the ACT (which used to stand for American College Testing, but it, too, apparently is now a name and no longer an acronym for anything), taken in the junior or senior year of high school, and they can assess psychomotor performance, such as on the Differential Aptitude Test (or DAT, which focuses on psychomotor skills). In other words, in order to predict how you will perform in the future (on the job or in college), aptitude tests try to measure your ability either directly by having you do stuff (like weld a joint) or trying to “see” relevant abilities by asking you questions that require the ability to respond correctly (like analogies and logical problem-solving tasks). What’s in a Name? Yes, believe it or not, the SAT officially, officially, officially stands for nothing. It’s now a brand and no longer an acronym. Not Standardized Achievement Test, not Scholastic Aptitude Test, not Silly Accomplishments on Thursday—nothing. Why? Well, the publishers at Educational Testing Service (ETS) in Princeton, New Jersey, didn’t have much comment, but perhaps it’s because the title of Standardized Achievement Test (used 30 years ago) is too limiting in what it says about the test. ETS wants the SAT to be used for many, many different audiences, and leaving the title somewhat ambiguous may allow for that—and may

Chapter 8 

■ Aptitude

also allow for lots more test takers and lots more money. At their heart, tests like the SAT and ACT (and GRE and many other college admissions tests) are both achievement tests and aptitude tests, and we could have put them in either this chapter or Chapter 7. Both of these types of aptitude tests are worthy of the title “aptitude test” because they have been shown to have predictive validity (see Chapter 4 for more about this kind of validity). Scores on these tests can be used to comfortably and accurately predict a set of future outcomes. In the case of the SAT, the outcome might be performance in college during the first year. In the case of the DAT, it is how well the test taker can perform both simple and complex basic motor skills—both of which are important for success in such jobs as automotive work, engineering, and the assembly of intricate parts. Aptitude tests are used to assess potential or future performance. For example, they might be used to determine how well a person might perform in a particular employment position or profession, and they are very often used by businesses and corporations to better understand the potential of an applicant or employee. These types of tests tap the ability to use specific job-related skills and to predict subsequent job performance. In our Google example earlier, this type of item taps thinking and reasoning skills, two skills the Google folks feel are important and related to the type of work in which they would like new employees to engage. It’s really easy to confuse aptitude (this chapter), intelligence (Chapter 9), and achievement tests (Chapter 7), because they often overlap in terms of the types of questions asked as well as their use. For example, an achievement item might very well predict how well someone might do in a particular field (which is kind of an aptitude concern) and the ability to perform well on the job probably correlates pretty well with general intelligence. So expect to see overlap between test purposes and similarity in test items among these three. What kind of test is it? Usually, the answer lies in what it’s being used for—to predict tomorrow’s performance (which is in the future), assess cognitive ability (which is right now), or measure learning (which happened in the past).

HOW TO DO IT: THE ABCS OF CREATING AN APTITUDE TEST Remember that an aptitude test is used mostly to predict performance. As the boss of a company, you might use an aptitude test that looks at the potential of employees to be one of the firm’s financial officers. You want to assess the candidate’s current knowledge of finance but also their potential to fit into a position that has new task demands (like creating and understanding a year-end report).

Tests  151

152  Part II 

■ 

Types of Tests

Another common example is the use of the SAT or the ACT, both of which basically consist of achievement-like items, for the prediction of how well a high school student will do during college. As it turns out, these tests do moderately predict performance during that first year, but the predictive validity falls off sharply for the later years. So although the SAT and ACT are looking at current level of skill (and some people call that cognitive ability) or knowledge (and that sounds like achievement), these results are used mainly to point to the future. Same Content, Different Purpose. We have pointed out how achievement tests and aptitude tests overlap a great deal. In fact, aptitude tests might include the same kind of material as achievement tests but just be used for a different purpose— and that is, perhaps, the key. For example, at the elementary school level, the Metropolitan Readiness Tests assess the development of math and reading skills. At the secondary school level, the SAT assesses writing, critical reading, and math skills. At the college level, the Graduate Record Exam assesses verbal, quantitative, and analytical skills. Different levels, all used for examining future potential, but all containing many similar, achievement-like items. The key is what aptitude tests are used for—the prediction of later outcomes. You might take the following steps in the creation of an aptitude test. Let’s say that this particular test will assess manual dexterity and it is intended for applicants to culinary school. (A lot of our examples have to do with cooking and food. Have you noticed that? We tend to write when we are hungry . . .) 1. Define the set of manual dexterity skills that you think is important for success as a chef. This definition could come from empirical examination of two groups of chefs—those who are struggling and those who are successful. 2. Now that you know what types of skills distinguish those who are successful from those who are not, create some kind of instrument that allows you to distinguish between the two groups. Find out which items work (that is, they discriminate between groups, or correlate with similar existing tests or actual performance as a chef) and which do not work. These “items” might not be questions but would likely be tasks that require test takers to move their hands and be scored on how well they do that, either through direct observation by some judge or through the creation of a product that could be judged (like a nicely peeled apple). Performance-based assessments like this are explored in great detail in Chapter 13, because teachers design these sorts of assessments all the time. 3. Ta-da! You have the beginnings of an aptitude test with a set of tasks that defines and distinguishes between those who may do well in culinary school (or at least those who have one set of skills necessary to be successful in culinary school) and those who may not.

Chapter 8 

■ Aptitude

4. Now test, test, and test aspiring chefs, working on the reliability and validity of the test, and after lots of hard work, including, maybe, formal research studies, you might be done. (Whew, even just typing all that has made us hungry. Wish we knew how to peel an apple.) Commercial test publishers are constantly revising their tests to make them as good and as accurate as possible. Take, for example, the Law School Admission Test (LSAT). This test consists of five 35-minute sections of multiple-choice questions. One section is on reading comprehension, one focuses on analytical reasoning, and two focus on logical reasoning. A 30-minute writing sample is the last element of the test. The following item is from one of the logical reasoning sections (and we send our thanks to the Law School Admission Council for allowing us to include this). 12. Navigation in animals is defined as the animal’s ability to find its way from unfamiliar territory to points familiar to the animal but beyond the immediate range of the animal’s senses. Some naturalists claim that polar bears can navigate over considerable distances. As evidence, they cite an instance of a polar bear that returned to its home territory after being released over 500 kilometers (300 miles) away. Which one of the following, if true, casts the most doubt on the validity of the evidence offered in support of the naturalists’ claim? (A) The polar bear stopped and changed course several times as it moved toward its home territory. (B) The site at which the polar bear was released was on the bear’s annual migration route. (C) The route along which the polar bear traveled consisted primarily of snow and drifting ice. (D) Polar bears are only one of many species of mammal whose members have been known to find their way home from considerable distances. (E) Polar bears often rely on their extreme sensitivity to smell in order to scent out familiar territory. What is the right answer? We aren’t sure, but Bruce guesses it is B (or maybe A, but probably B), but reminds you he is neither a licensed attorney nor a licensed polar bear. Once this item is administered (along with all the others, of course), the test makers take the results and analyze them to find out how effectively they differentiated those who scored high versus those who scored low on the total test. The item may be rewritten, thrown out altogether, revised heavily, and then used again in a subsequent test. And the test designers might very well use the tools of Item Response Theory that we discussed in Chapter 6.

Tests  153

154  Part II 

■ 

Types of Tests

Do aptitude tests sound like “What am I best suited for?” tests, like the kind you would take when you visit the career or vocational planning center at school? In many ways, they very much are. But because career planning and vocational tests are so important in their own right, we’ll leave that discussion for Chapter 11. For now, keep in mind that a vocational test is very much a specialized type of aptitude test but uses attitudes and preferences to make its predictions, not ability.

TYPES OF APTITUDE TESTS The term aptitude encompasses so many different skills and abilities that it is best to break it down into some of the many different areas on which aptitude tests might focus.

Mechanical Aptitude Tests Mechanical aptitude tests focus on a variety of abilities that fall into the psychomotor domain, from assembly tests (where the test taker has to actually put something together) to reasoning tests (where the test taker has to use reasoning skills to solve mechanical types of problems). Some tests that fall into these categories are the Armed Services Vocational Aptitude Battery Form, the Bennet Mechanical Comprehension Test, and the Differential Aptitude Test.

Artistic Aptitude Tests How very difficult it is to tell whether one has “promise” (because, after all, that is what aptitude is, right?) in the creative or performing arts. Such tests evaluate artistic talent, be it in music, drawing, or other forms of creative expression. For example, there is the Primary Measures of Music Audition (for Grades K–3) and the Intermediate Measures of Music Audition (for Grades 1–6). Both of these tests require children to listen to music and then answer questions using pictures rather than numbers or words (thereby reducing any confounding that differences in reading ability might introduce). The children must decide whether pairs of tonal or rhythm patterns they hear sound the same or different.

Readiness Aptitude Tests In the early 20th century, Arnold Gesell started the Child Study Center at Yale University, where he documented maturation as an important process in the growth and development of young children. He not only documented it, he micro, micro studied it, creating extensive film libraries showing aspects of individual growth and development.

Chapter 8 

■ Aptitude

Arnold Gesell was interested in how ready children were for school, and the term aptitude test had not even been coined yet. Nonetheless, that’s what his readiness tests assessed. And very interestingly, readiness and well-being both fall under the broader umbrella of hygiene (another term used then for mental health). One of his ideas was that children become ready for particular phases of learning and development at different times, and the introduction and use of readiness tests at all levels of education (but especially the elementary level) serve the purpose of evaluating whether a child is ready for school. For example, the Gesell Child Development Age Scale is designed to determine in which of the 10 Gesell early development periods a child is presently functioning. That information, in turn, can be used by early educators and other teachers to decide whether a child is ready for a particular level of activity. And although we don’t call them readiness tests, you can bet that the Graduate Record Exam (GRE) is used to determine whether students are ready for graduate school. But at whatever level of education, the intent is the same—to find out if the test taker is ready for what comes next. That’s a prediction question, and that’s the business of aptitude tests.

Clerical Aptitude Tests There is an absolutely huge number of people who are employed in clerical positions, including everything from file clerks to cashiers to administrative assistants. These kinds of employment positions require great attention to detail, especially numbers and letters. One of the most famous and often used of these aptitude tests is the Minnesota Clerical Test (MCT), which measures speed and accuracy in clerical work in high school students and was first used in 1963. This test consists of 200 items, with each item containing a pair of names or numbers. The test taker is simply supposed to indicate whether the pair of names or numbers is the same or different. Although this may sound like an overly simplified approach, the MCT is a very accurate and highly predictive test (and that, of course, is key), where the final score is the number of incorrect responses subtracted from the number correct.

WHAT THEY ARE: A SAMPLING OF APTITUDE TESTS AND WHAT THEY DO Table 8.1 shows you some of the most often used aptitude tests and all the information you will ever want to know about them.

Tests  155

156

The Denver II is designed to screen for developmental delays.

Denver II

Produces 4 scores that describe developmental skills (personalsocial, fine motor-adaptive, language, gross motor) and 5 scores that describe behavior (typical, compliance, interest in surroundings, fearfulness, attention span)

Measures four areas: English, math, reading, and scientific reasoning

Measures reading comprehension, writing, and math ability

The SAT is designed to assess what you have learned in high school and how prepared you are to succeed in college.

Purpose and What It Tests

  Some Widely Used Aptitude Tests

ACT

SAT

Title/Acronym (or What It’s Often Called)

TABLE 8.1 

Birth to age 6

Any age, but most people are 13 or older

Freshman in high school and older

Grade Levels/Ages Tested

Children’s development tends to occur in a predictable pattern, and although there are individual differences across ages, there are enough similarities that a general level of developmental competence can be assessed accurately.

Structured to measure general education (a development that sounds like achievement) and ability to do college-level work

Emphasis on the meaning of words in extended contexts and on how word choice shapes meaning, tone, and impact.”

From their website: “Focus on the knowledge, skills, and understandings that research has identified as most important for college and career readiness and success.

Conceptual Framework

3. Easy and quick to use, which appeals to health care providers who are not necessarily trained in the administration of complex tests

2. Uses a visual form for recording scores that aligns with child’s chronological age and accounts for prematurity

1. Used to be called the Denver Developmental Screening Test

Chooses topic areas based on reviews of common state standards and surveys of educators as to what the important college skills are

Developed as a competitor to the SAT.

The essay portion is now optional because many colleges do not require it. There also is no longer a penalty for guessing, so no extra deductions for wrong answers.

What’s Interesting to Note

157

Minnesota Clerical Test (MCT)

Armed Services Vocational Aptitude Battery (ASVAB)

Differential Aptitude Test (DAT)—Fifth Edition

Variable measures are accuracy of number and name comparison.

This test measures speed and accuracy in clerical work.

General science, arithmetic reasoning, word knowledge, paragraph comprehension, numerical operations, coding speed, auto-shop information, math knowledge, mechanical comprehension, electronics information, academic ability, verbal ability, math ability

According to the test developers, the ASVAB is intended “for use in educational and vocational counseling and to stimulate interest in job and training opportunities in the Armed Forces.”

Nine scores in verbal reasoning, numerical reasoning, abstract reasoning, perceptual speed and accuracy, mechanical reasoning, space relations, spelling, language usage, and total scholastic aptitude

The DAT measures students’ ability to learn or to succeed in a number of different areas.

Grades 8–12 and adults

High school, junior college, and young adult applicants to the armed forces

Grades 7–9, Grades 10–12, and adults

Designed to assess the accuracy and efficiency of processing numerical and linguistic information

Measures broad aptitudes (and has its own kind of “g,” or general underlying factor) that have relevance to different educational and occupational domains

2. Very easily administered in 15 minutes

1. Norms are presented for 10 different vocational categories.

Used primarily for testing the potential of armed services enlistees

3. The DAT includes a practice test so students can become familiar with the type of items and time requirements.

2. Often used in career guidance

1. Used mostly in educational counseling and personnel assessment

158  Part II 

■ 

Types of Tests

VALIDITY AND RELIABILITY OF APTITUDE TESTS Validity is a unitary concept, and all types of validity arguments and evidence can come into play in determining validity of a measure, but certain validity approaches seem particularly important for different tests with different purposes. Aptitude tests are, by definition, meant to predict the future. Consequently, the classic proof that they do that is a correlation between an aptitude test’s scores and scores on some relevant future measure, like college GPA or evaluations of professional performance. So, it is criterion-based validity evidence (specifically predictive criterion validity) that is required, at a minimum, to convince people of an aptitude test’s validity. For reliability, in addition to the pretty much mandatory requirement of a high internal reliability estimate, aptitude test developers might be particularly interested in demonstrating good test–retest reliability—evidence that the predictor (the aptitude test) is stable and scores don’t fluctuate easily depending on the wind or the day a person takes the test. Finding correlations between two sets of scores depends so much on the high reliability of both measures, and the stability across time of both scores is critical.

Summary Aptitude tests are somewhat like achievement tests in that they often include testing of knowledge of a particular topic or subject matter, but aptitude tests almost always measure ability, as well. The biggest difference is that the results of aptitude tests are used for different things than are the results from achievement tests. An aptitude test looks to future potential rather than current levels of performance. Now that you’re an expert on achievement and aptitude tests, it’s time to move on to psychological tests. That’s exciting because these tests have personality!

Time to Practice 1. Why can the same set of achievement items be used for both an aptitude test and an achievement test? 2. Access the Buros Institute online through your library and summarize one of the reviews of any aptitude test. Be sure to address the conceptual rationale for why the test was created in the first place. 3. Design a five-item aptitude test (don’t worry, you don’t have to do any reliability or validity studies) that examines potential in any one of the following professions: a. Lawyer b. City planner

Chapter 8 

■ Aptitude

Tests  159

c. Doctor d. Airline pilot 4. How does an aptitude test differ from an achievement test? 5. Many aptitude tests include the term readiness in their title. Why? 6. At this point in your education, you likely have experience taking multiple aptitude tests. What have your experiences been like? In your opinion, what are the advantages of these types of tests? How about a drawback? 7. How might aptitude tests be useful to employers? How about to marathon runners? 8. Name an aptitude test and how you would test its predictive validity.

Want to Know More? Further Readings •

Hoffman, E. (2001). Psychological testing at work: How to use, interpret, and get the most out of the newest tests in personality, learning style, aptitudes, interests, and more! New York: McGraw-Hill.

Aptitude testing need not be boring. Find out how the leading employers use it to find just who they want. This book talks about employee screening and how to assess job candidates’ traits, interests, and skills. •

Lewandowski, L. J., Berger, C., Lovett, B. J., & Gordon, M. (2016). Test-taking skills of high school students with and without learning disabilities. Journal of Psychoeducational Assessment, 34, 566–576.

This study is interesting for a bunch of reasons, not the least of which is that the authors go old school and refer to the SAT as the Scholastic Aptitude Test (which we aren’t supposed to say out loud anymore). It’s most interesting because they assess the test-taking skills of 776 high school students, 35 of whom were diagnosed with learning disabilities. Students in the learning-disabled group obtained lower scores than those in the nondisabled group on speed, comprehension, vocabulary, and decoding of reading, spent more time reviewing comprehension questions, and were less active in looking for answers in the reading passages. Where the groups did not differ, in their levels of test anxiety or confidence and vocabulary score, best discriminated between groups and best predicted reading comprehension performance.

And on Some Interesting Websites •

Want to work for Judge Judy? See if you have what it takes to be a court reporter at https:// study.com/articles/should_i_become_a_court_reporter_-_quiz_self-assessment_test.html.



Need help with the SAT, GRE, or ACT? Take a look at http://www.number2.com (Quit snickering. Number 2 refers to a Number 2 pencil, get it?) for free preparation help.

160  Part II 

■ 

Types of Tests

And in the Real Testing World Real World 1 Few things are more important than public safety, and being able to accurately assess candidates for public safety positions (such as police officers) is about as good an application of testing practices as possible. This study explores the relationship between criteria used to select police officers and some standard markers such as education level, grades in high school, and scores from two aptitude tests. These are the variables supported as predictor variables in cadet selection and final grade in select academy courses. Want to know more? McKenna, G. (2014). Empirical evaluation of aptitude testing and transcript review for police cadet applicants. International Journal of Police Science & Management, 16, 184–195.

Real World 2 Here’s where an ability test can be used to predict outcomes and, in a sense, acts very much like an aptitude test. These authors begin their work based on the assumption that individual differences in cognitive ability among children raised in socioeconomically advantaged homes are primarily due to genes, whereas environmental factors are more influential for children from disadvantaged homes. They investigate the origins of this gene × environment interaction in a sample of 750 pairs of twins measured at age 10 months and 2 years. A gene × environment interaction was evident in the longitudinal change in mental ability over the study period. Want to know more? Tucker-Drob, E. M., Rhemtulla, M., Harden, K. P., Turkheimer, E., & Fask, D. (2011). Emergence of a gene × socioeconomic status interaction on infant mental ability between 10 months and 2 years. Psychological Science, 22, 125–133.

Real World 3 A group of widely used musical ability aptitude tests was examined in terms of their correlations with skills believed to be important for successful musicians, such as aural perception (having an ear for music), affective outcomes (emotional feeling), and creativity. The researcher, Josef Hanson of the University of Massachusetts, found that there were moderate to strong correlations between the tests and these variables, providing some criterion-related validity evidence for their use. Hanson, J. (2019). Meta-analytic evidence of the criterion validity of Gordon’s music aptitude tests in published music education research. Journal of Research in Music Education, 67(2), 193–213.

9 INTELLIGENCE TESTS Am I Smarter Than My Smart Phone? Difficulty Index ☺ ☺ ☺ (getting harder)

LEARNING OBJECTIVES After reading this chapter, you should be able to • Explain the different theories of intelligence. • Describe the development of the first and most famous intelligence test, the Stanford–Binet. • Compare and contrast several major intelligence tests. • Apply the concepts of validity and reliability to intelligence tests. 

I

n the history of standardized tests, there is no score so misunderstood as an IQ score. It is a score that can bring joy or concern or anger or confusion. It can help a student get a more enriching educational experience, but it is also a common weapon for teasing or bullying. It can build you up or lower your self-esteem. So, let’s spend a paragraph or two demystifying intelligence tests and removing most of the emotion attached to the idea of intelligence.

THE ABCs OF INTELLIGENCE No physicist has ever seen an atom (just reflections of one), and no social or behavioral scientist has ever “seen” intelligence. Its existence is inferred by the way people 161

162  Part II 

■ 

Types of Tests

behave. For intelligence, we can observe relevant behavior by seeing if you can retrieve information, remember stuff, and solve problems. And tests are our way of making that behavior happen. Here are the (for some reason) little-known truths about intelligence tests. First, they are designed to predict how you will do in school. That’s their original and, substantially, still their purpose. So, they are aptitude tests, right, like we talked about in Chapter 8. Let’s think what skills might be useful in getting high grades in school? Well, we suppose knowing basic school stuff (verbal and math knowledge) would be important (in other words, evidence that you’ve learned things in the past). Also, the ability to solve problems, think analytically, process information, remember well, and comprehend the world around you would all come in handy if you wanted to be a good student. It makes sense, then, that that’s what intelligence tests measure—they are built from bunches of subtests (small little mini-tests) that tap into those sorts of constructs. These tests are called intelligence tests because it seems reasonable that if you combine all these abilities, you are measuring intelligence. And, by the way, for this purpose of the tests, predicting school performance, intelligence tests have very good criterion-based validity— they correlate very well with the grades students get in school. There are several theories of the structure of intelligence (is it one thing or does it have several different dimensions?) and several tests based on them. But most define intelligence something like this (from psychologist David Wechsler, whose Wechsler intelligence scales have been popular for more than 75 years): “The global capacity of a person to act purposefully, to think rationally, and to deal effectively with (their) environment.” So, intelligence is a construct represented by a group of related variables (such as verbal skills, memory, mechanical skills, comprehension, and more) and has some theoretical basis to it. Let’s talk about that for a bit—the different theoretical definitions that people have offered. Tests of intelligence are so closely tied to the theory on which the tests are based, we first have to talk a bit about these different approaches. That way, when it comes to talking about the development of the tests, we can see the logical extension from the idea to its application.

The Big g One of the first conceptualizations of intelligence was proposed by Charles Spearman back in the late 1920s (around 1927). Spearman believed that underlying all of intelligent behavior is a general factor that accounts for explaining individual differences between people in terms of acting smart. He named this theory General-Factor Theory and named the actual factor g to represent general intelligence. He argued that there is one type of intelligence that accounts for individual differences between people on this construct—specifically, how well one deals with abstract relationships.

Chapter 9 

■ Intelligence

However, because not everyone with a high g could do everything well (remember, write, do math, assemble a puzzle, and so on), he also recognized that there are a bunch of other factors that are specific in their nature, and he called them the s’s to represent specific types of intelligences. So Spearman’s theory represents intelligence as this general factor with a bunch of s’s floating around inside of it, and these s’s are related to one another to varying degrees. For example, the specific type of intelligence associated with one skill (say, assembling blocks in a particular configuration) might be related to basic math skills but only weakly related to being able to remember a set of digits. To g or Not to g. There remain ongoing debates about whether intelligence is best represented by the big g (for a general underlying ability) or a small s (for more specific and independent abilities). There is no answer, but it’s a great question that helps fuel new ideas and approaches to the assessment of intelligence. However intelligence is broken down, the subtests on intelligence tests tend to correlate with each other, suggesting that there is probably an underlying trait accounting for a substantial proportion of intelligence.

More Than Just the Big g: The Multiple-Factor Approach Well, as always with science, Spearman’s idea of one, and only one, general factor did not exactly fit other models and ideas. In fact, where Spearman thought that there was only one type of intelligence, g (although that is a bit nondescript), others, such as Louis Thurstone from the University of Chicago (Go Maroons!), thought that there were different types of intelligence, which he called primary mental abilities. Each of these abilities, such as verbal ability, spatial ability, and perceptual speed, is independent of the others, said Thurstone. So from a very (g)eneral view of intelligence set forth by Spearman, Thurstone put forth a view that relies heavily on the presence of many different types of (primary or basic) intelligences that are, most importantly, independent (both conceptually and statistically) of one another. Although it’s really tough to find sets of human behaviors or attributes or abilities that are separate from, or unrelated to, one another, his theory has been the basis for a widely used test of intelligence, the Primary Mental Abilities Test, which has appeared in many different forms since the late 1930s. Factors, Factors, and More Factors. Wanting to verify whether intelligence was one factor or several, led to the invention of a powerful statistical strategy called Factor Analysis. This is kind of neat—new ideas help create methods that are in turn used to look at those ideas. At one point, the most popular theory of intelligence led us to believe that intelligence was represented by one big super factor, called g. Then, along came other theorists who believed that intelligence consisted of a bunch of more specific abilities (some related and some not), then a new idea that intelligence is just a collection of independent abilities. To explore

Tests  163

164  Part II 

■ 

Types of Tests

that last bit of thinking, factor analysis was created and refined and has been used ever since in the development and verification of many different tests of intelligence. Factor analysis can be used to determine which of a set of variables can be grouped together to form a conceptually sound unit (called a factor). For example, two individual tests that examine one’s ability (1) to assemble puzzles and (2) to build three-dimensional objects might both be reflections of a factor called spatial ability. If, indeed, these two tests share something in common across a large number of individuals and, say, do not share something in common with reading skills or other factors we believe are not related to one another— voila! We may have a factor that can stand on its own (is independent), and we may even have an element of our theory of intelligence. Factor analysis essentially looks at the correlations among a bunch of variables (like test items) and sees whether some items group together, correlate more with each other than they do with other variables. If they do, those groups of highly correlated items might represent a factor. Factor analysis is used to validate theories of intelligence as well as to study many different phenomena in the social and behavioral sciences, and it is well worth knowing about.

“Book Smart” or “Street Smart”? Do you ever hear people talk about their children (or, maybe, competitors on reality TV shows) as being street smart, but not book smart? Intelligence researchers sometimes think that way, too, as there seems to be one way of being smart that focuses on learning information and “knowing stuff” and another type of being smart that is all about solving puzzles and figuring out solutions to problems. There’s a theory of intelligence, called the Horn-Cattell Model, that breaks intelligence down into those two aspects of being intelligent. The book smart component of stored knowledge is called crystallized intelligence and the street smart component of problem solving and being able to learn new skills is referred to as liquid intelligence. So much research supports this formulation that it is widely accepted by those who study intelligence. There is an expanded version of this model which has specified broad skills under those classifications called the Cattell-Horn-Carroll Theory. This broader framework, though, includes a general g that seems to underlie performance across all cognitive skills tests. It all comes back to g with most theories of intelligence. But not all, as you’ll see next!

The Three-Way Deal Another model of intelligence that postulates several “types” of intelligence or cognitive abilities was formulated by Robert Sternberg and is called the Triarchic Theory of Intelligence. The basis for his theory is that intelligence can best be explained through an understanding of how people think about and solve problems—the information-processing approach.

Chapter 9 

■ Intelligence

According to Sternberg, to understand what we call intelligence and how people differ from one another, we have to look at the interaction between componential intelligence, experiential intelligence, and contextual intelligence. • Componential intelligence focuses on the structures that underlie intelligent behavior, including the acquisition of knowledge. It’s what you know. • Experiential intelligence focuses on behavior based on experiences. It’s what you do. • Contextual intelligence focuses on behavior within the context in which it occurs and involves adaptation to the environment, selection of better environments, and shaping of the present environment. It’s how you behave. In developing this approach, Sternberg was attempting to ground his theory in the interaction between the individual’s skills and abilities and the demands placed on that individual by the environment.

Way More Than One Type of Intelligence: Howard Gardner’s Multiple Intelligences As research on intelligence has become more and more sophisticated and has included findings from such fields as neuroscience and developmental psychology, these newer theories have increased significance for everyday behavior. Such is the case with Howard Gardner’s model of multiple intelligences. Gardner theorizes that an individual’s intelligence is not made up of one general factor but eight different types of intelligence that work independently of one another yet in concert. These eight are as follows: 1. Musical intelligence includes the ability to compose and perform music. 2. Bodily-kinesthetic intelligence includes the ability to control one’s bodily movements. 3. Logical-mathematical intelligence includes the ability to solve problems. 4. Linguistic intelligence includes the ability to use language. 5. Spatial intelligence includes the ability to manipulate and work with objects in three dimensions. 6. Interpersonal intelligence includes the ability to understand others’ behavior and interact with other people. 7. Intrapersonal intelligence includes the ability to understand ourselves. 8. Naturalist intelligence includes the ability to identify and understand patterns in nature.

Tests  165

166  Part II 

■ 

Types of Tests

For example, a great NBA basketball player probably excels in kinesthetic intelligence. Yo-Yo Ma, the famous cellist, is high in musical intelligence, and so on. Gardner’s idea is, indeed, a very interesting one. He assumes that everyone has each of these intelligences to some degree, that all of us are stronger in some intelligences than others, and that some of us (and perhaps most) excel in one. You may not be a naturalist and able to understand patterns of bird flight or a musician able to play Beethoven’s Fifth, but maybe you sure are good at foreign languages and are a real schmoozer (interpersonal intelligence). Gardner emphasizes the importance of individual differences in the assessment of these eight intelligences—shades of what early psychometricians talked about almost a century earlier!

Emotional Intelligence: An Idea That Feels Right Nowadays, intelligence can be defined in ways that don’t even involve our traditional views of what being “smart” means. Psychologist Daniel Goleman has popularized the idea of emotional intelligence, the ability to be emotionally sensitive to others and to manage and understand our own emotions and solve problems involving emotional issues. In his book, Emotional Intelligence, Goleman argues that traits and characteristics such as self-awareness and persistence are more important than IQ as we traditionally define intelligence (in terms of the skills needed to succeed in school, remember that’s what intelligence tests are for), and that, of course, children should be taught how to identify their own and others’ emotions. These are the skills that make real-world people successful, regardless of their traditional level of intellect. And like Gardner’s multiple intelligences, perhaps emotional intelligence is not meant to be evaluated in the more traditional way but instead to depend on such skills as observation and keeping journals and other, less objective but likely richer ways of assessing outcomes. There’s no debate about this: Intelligence is a very complex construct that is difficult to define and difficult to measure. And, for a while now, a troubling finding has been that there are often average score differences between racial and ethnic groups and, sometimes on certain subtests, differences between sexes. It’s unclear why these differences appear, but it seems likely that the explanation is not biological because there are no important biological differences in the physical brain at birth between races or genders. That pretty much leaves environment and culture as the most reasonable explanations. These differences may reflect the way the test is constructed and the way items appear on the test. After all, the content and format reflect the test developers’ orientation and definition of intelligence and those things are culturally determined. And for centuries, our way of defining intelligence and measuring it, and even how we think of tests in general, has been shaped and developed by white males. So, the tests may

Chapter 9 

■ Intelligence

be biased toward the culture and experiences and ways of thinking of the test designers in that way. We explore this thorny issue of test bias a lot more in Chapter 15.

FROM THE BEGINNING: (ALMOST) ALL ABOUT THE STANFORD–BINET INTELLIGENCE SCALE Let’s look at one of the most popular tests of intelligence, the Stanford–Binet, to get an overall picture of how such a test is developed, how it is used, and how unquestioned assumptions within one’s culture can shape our view on what intelligence is and how it should be measured.

A Bit of History of the IQ This is pretty interesting stuff, how the Stanford–Binet Intelligence Scale came to be. Around 1905, Parisian Theodore Simon was doing some work in the area of reading. In fact, he had close ties to the famous developmental psychologist Jean Piaget. Piaget’s experience with Simon led Piaget to speculate about the successes and failures of children, especially kids in elementary school, which led to Piaget’s famous work on cognitive development and epistemology (the nature of knowledge). You may remember from your Psych 1 or introductory child development class that Jean Piaget was one of the foremost and most productive cognitive developmental psychologists. How interesting it is that his definition of intelligence grew from much of his work in France and his later work with his own children (the subjects for many of his early studies). His definition? Intention—that those acts showing intention were intelligent acts. Very cool, because it is almost entirely removed from the realm of content, and this focus directly reflects his own theoretical views on human development and the development of intelligence. Well, as Piaget was becoming increasingly intrigued by the nature of the failure to read among “normal” children, Simon was working with Alfred Binet to develop a test (for the schoolchildren of Paris) to predict and distinguish between those who would do well in school and those who would not. (We told you that was the purpose of intelligence tests; now do you believe us?) Binet also worked with the famous neurologist Jean-Martin Charcot, who also trained Sigmund Freud in the use of hypnotism as a clinical technique. Interesting connections, no? Anyway, the result of Simon and Binet’s efforts was the first 30-item intelligence scale (in 1905) that was used to identify children who were “mentally retarded” (or those who were thought not to be able to succeed in school). Terms such as “normal” and “mentally retarded” and even “imbecile” and “idiot” were used all the time in those days to talk about intelligence; no wonder we associate IQ with

Tests  167

168  Part II 

■ 

Types of Tests

self-worth! Today, we use less insulting terms such as typically developing or intellectual disability. The questions and tasks themselves were substantially chosen by classroom teachers (all of whom were white, just saying). At first, the items were arranged by difficulty, and then later, the arrangement was by age level. So in addition to having a chronological age for the child, the test results also yielded a mental age (this coinage taking place about 1916). Having both a true, chronological age and a mental age, allows for a handy ratio of these two values to give a very good idea whether someone is behind, even with, or ahead of their chronological age. (See why the term retarded was used to describe a person whose intelligence is much lower than typical?). For example, someone who is chronologically 10 years old, or 120 months, with a mental age of 10.5 years (or 126 months) would have a “score” of 126/120. 126 divided by 120 is 1.05. Multiply that awkward number with a decimal by 100 and you have the smooth, sleek “Intelligent Quotient” or IQ! For our 10-year-old, then, the resulting IQ score from a mental age of 10.5 years and a chronological age of 10 years is equal to IQ =

MA 126 × 100 = × 100 = 105 CA 120

As you can see by examining this formula, if one’s mental and chronological age are equal, then one’s IQ score is 100, and that’s originally how the average IQ became 100. (Nowadays, intelligence tests are standardized so that the mean score will always be 100, even as human’s abilities grow and change across generations.) You can imagine how cool it was to have one number, a simple ratio of mental age to chronological age, to express one’s theoretical intelligence quotient. But nope— the reason why this approach was never really embraced was that the upper limit depends on the upper age limit for the test. For example, if the items go up only to an age level of 21, anyone with a chronological age of greater than 21 has to have an IQ of less than 100. Not such a good idea and why, today, we don’t think of intelligence as IQ and instead use more descriptive and informative terms for describing someone’s level of intelligence. We’re now in the fifth edition of the Stanford–Binet (or SB-5, as the publisher terms it), which has undergone many different revisions over a 100-year time span. The SB-5 is used in a variety of ways, other than to assess intelligence or diagnose developmental disabilities. The publisher’s website (http://www.hmhco.com/ hmh-assessments/other-clinical-assessments/stanford-binet#tab-sb5) lists such potential uses as • clinical and neuropsychological assessment, • early childhood assessment, • evaluation for special education placements,

Chapter 9 

■ Intelligence

• adult social security and workers’ compensation evaluations, and • research on abilities and aptitudes. Today’s Stanford–Binet assesses (in people from about 2 to 85+ years of age!) five different factors—fluid reasoning, knowledge, quantitative reasoning, visual-spatial processing, and working memory—these areas being an outgrowth of the fluid, crystallized, and memory focal points mentioned earlier. And when each of these areas is assessed and scored separately, you have 10 subtests (verbal fluid reasoning, nonverbal fluid reasoning, etc.). In the end, each individual test taker has a score for verbal and nonverbal performance and a full-scale IQ. Newer editions of the Stanford–Binet, and most intelligence tests, have been developed with regard to avoiding cultural bias in task and item selection. There are several strategies for this, including using a representative sample of people (in terms of characteristics like race, ethnicity, gender, and so on) on which to norm the instrument and choose items and statistical analyses to identify and remove sources of bias. Chapter 15 talks about these statistical strategies, which use Item Response Theory methods (which we introduced you to in Chapter 6). The only thing we have left to talk about regarding the Stanford–Binet (in our very brief overview) is how the test is administered. This is interesting because it provides a basis for how most types of intelligence tests are given.

What’s the Score? Administering the Stanford–Binet (and Other Tests of Intelligence) Think about how you might assess someone’s level of intelligence, almost regardless of the different models of intelligence we have discussed so far. One strategy that makes sense is to challenge the person with more and more difficult tasks until they aren’t successful, push them a little bit more just to be sure you’ve found their level and then assign a score based on how far they got. You can think of this as the Jeopardy strategy, like on that TV game show where questions are arranged in order of difficulty. Well, this is just what most intelligence tests do. Most of these tests begin with some pretty easy questions and tasks to find a starting point and then gradually introduce more difficult items to see what the individual can do. Eventually, the items become so difficult that it would not be wise to test any further. This range of difficulty helps define the starting and ending points in the scoring of performances on such tests as the Stanford–Binet. The basal age is the lowest point on the test where the test taker can pass two consecutive items that are of equal difficulty. The examiner knows that at this point, they can feel confident that the test taker is on firm ground and could pass all the items that are less difficult. Then there’s ceiling age, and this is the point where at least three out of four items are missed in succession. This is the point at

Tests  169

170  Part II 

■ 

Types of Tests

which going any further would probably result in additional incorrect responses— in other words, it’s the test taker’s limit, and time to stop. The number correct on the test is then used to compute a raw score, which is converted to a standardized score and then compared with other scores in the same age group. And because the Stanford–Binet is a standardized test and has had extensive norms established, such comparisons are a cinch to make and very useful. Stephen Gould, the late and great Harvard historian of science and paleontologist, wrote a classic book titled The Mismeasure of Man. His thesis is that IQ scores are a terrible tool for classifying people. One of his more shocking examples is how World War II Eastern European immigrants were given IQ tests as they arrived in this country seeking political asylum. They failed in miserable numbers, and a large proportion were denied their freedom, sent back to their home countries, and died at the hands of brutal bullies. Why did so many fail? This is an easy one: They didn’t speak English, the language of the tests! We’ll talk much more about this issue in Chapter 15, but for now, keep in mind that intelligence scores are often as much about the format of the test as they are about everyday human behavior. And if this example seems ridiculous, is it that much different than judging student’s intelligence simply on how well they read?

AND THE FAB FIVE ARE . . . There are many, many different types of intelligence tests, ranging from those you can find on the internet and take for fun (no training needed, and you get what you pay for) to those that require years of training to administer and score successfully. In Table 9.1, we show the range of the latter kind—those that take training to learn how to administer and score, and those that have a strong theoretical background and utility in our testing world. In other words, these are accepted as valid when used correctly.

VALIDITY AND RELIABILITY OF INTELLIGENCE TESTS Validity is a unitary concept, and all types of validity arguments and evidence can come into play in determining validity of a measure, but certain validity approaches seem particularly important for different tests with different purposes. Intelligence tests, as their name indicates, are designed to measure that very complex and abstract construct we call intelligence. Consequently, discussions of the validity for these tests often focuses on that construct—whether it is defined correctly, whether the assumed dimensionality of the construct is accurate (is it one thing, general g, or different things, like a verbal piece and a performance piece?), and whether the items and subscales are consistent with those dimensions.

171

The Stanford–Binet Intelligence Scale is designed as “an instrument for measuring cognitive abilities that provides an analysis of the pattern as well as the overall level of an individual’s cognitive development.” It provides 20 scores: verbal reasoning (vocabulary, comprehension, absurdities, and verbal relations); abstract-visual reasoning (pattern analysis, copying, matrices, and paper folding and cutting); quantitative reasoning (quantitative, number series, and equation building); short-term memory (bead memory, memory for sentences, memory for digits, and memory for objects).

Stanford–Binet Intelligence Scale, 5th Edition (SB-5)

(There are also Wechsler scales for children and even for preschoolers.)

The WAIS is designed to assess intellectual ability by testing verbal skills (vocabulary, similarities, arithmetic, digit span, information, comprehension, and letter-number sequencing) and performance (picture completion, coding, block design, matrix reasoning, visual puzzles, figure weights, cancellation, and symbol search). There is also a verbal comprehension index, a perceptual organization index, a working memory index, and a processing speed index.

Purpose and What It Tests

  Five Widely Used Intelligence Tests

Wechsler Adult Intelligence Scale (WAIS), 4th edition, is the most recent.

Title/Acronym (or What It’s Often Called)

TABLE 9.1 

Ages 2 years through adulthood

This most popular of all intelligence tests focuses on both verbal and nonverbal tests of intelligence.

Wechsler viewed intelligence as a merging of the general factor (g) that Spearman identified, coupled with distinct abilities. His practical emphasis was on the ability to act with a purpose in mind and to think in a logical manner.

Ages 16 through 89 years

(Continued)

3. The vocabulary subtest is given first to help identify a starting place for other tests.

2. Uses standard age scores for comparison between test takers

1. Uses adaptive testing procedure, where test takers answer only those items whose difficulty is appropriate for their performance level

4. Based on a commonsense view of intelligence, where intelligent behavior is behavior that is adaptive and useful.

3. David Wechsler, the author of the original test, studied with Simon (of the Simon-Binet Scale) in Paris, and some 30 years later, Wechsler started the development of his own test.

2. Oldest and most frequently used adult intelligence test

1. The latest sample used for standardization matches 1995 census data with respect to gender, socioeconomic status, race and ethnicity, educational attainment, and geographical residence.

Conceptual Framework What’s Interesting to Note

Ages Tested

172

The K-ABC assesses cognitive development in 16 subtests: Magic Window, Face Recognition, Hand Movements, Gestalt Closure, Number Recall, Triangles, Word Order, Matrix Analogies, Spatial Memory, Photo Series, Expressive Vocabulary, Faces and Places, Arithmetic, Riddles, ReadingDecoding, and Reading Understanding.

As a quick estimate of general cognitive ability, the Slosson Full-Range Intelligence Test offers verbal, performance, and memory subtests, with the performance subtest divided into abstract and quantitative sections.

The McCarthy Scales were developed to “determine . . . general intellectual level as well as strengths and weaknesses in important abilities.” Scales tested (using 18 different subtests) are verbal, perceptual-performance, quantitative, composite, general cognitive, memory, and motor.

Slosson Full-Range Intelligence Test (S-FRIT). The most recent edition is the 4th, called the SIT-4.

McCarthy Scales of Children’s Abilities (MSCA)

Purpose and What It Tests

 (Continued)

Kaufman Assessment Battery for Children (K-ABC). The current edition is the KABC-II NU.

Title/Acronym (or What It’s Often Called)

TABLE 9.1 

The test is used with children from ages 2 years, 4 months to 8 years, 7 months.

Ages 5 to 21 years

Ages 2.5 to 12.5 years

Ages Tested

McCarthy thought it very important to emphasize a test (and the items on it) construction that can identify clinical and educational weaknesses in the child.

Items reflect a concentration on assessment of both crystallized and fluid intelligence.

Subtests are grouped into categories for those that require sequential processing of information and those that require simultaneous processing, with the test taker receiving separate global scores for each. This is basically an information-processing approach.

3. The MSCA is sometimes used as a screening test for readiness to enter a specific school grade.

2. Good predictor of school achievement

1. Test materials include games and toys that help get children’s interest and maintain their attention.

3. Often used as a brief intelligence assessment, then followed up by more extensive tests such as the WAIS.

2. Items are arranged in order of difficulty so examiner can get a very quick idea as to test taker’s level of intelligence.

1. Uses a picture book for some questions

3. Sociocultural norms for race and parental education were developed and can be used along with other norms. They can also help in scoring and interpretation.

2. Lots of pictures and diagrams among the items

1. A nonverbal scale is available for hearing impaired, speech and language disordered, and non-English-speaking children ages 4 to 12.5 years. One of the first tests to provide for these populations of children

Conceptual Framework What’s Interesting to Note

Chapter 9 

■ Intelligence

Tests  173

As an aptitude test (remember intelligence tests are meant to predict academic performance), you might think that debates about validity for an intelligence test center on predictive-criterion validity—do IQ scores correlate with future grade point average in school, for example. Developers of these tests (many of which are owned by for-profit companies) do make an effort to demonstrate that there ARE relationships like this found, but they want to make sure we know that intelligence is not the same thing as achievement, so they don’t want those correlations to be too big. In terms of reliability, intelligence tests are high-stakes tests. That is, decisions about individuals are made based on a single score and those decisions affect the individuals. As a high-stakes test, it is important that there be very little randomness in scores. Because these tests are made up of many subtests and many items (they are usually given one-on-one and take about an hour and a half to administer) and these items tend to correlate together (there’s that general g again), intelligence tests demonstrate very high internal consistency reliability. Further, intelligence is considered a stable trait that doesn’t change much as children grow older, so test–retest reliability is critical, as well. Your score in the third grade is probably about the same as how you would score in the eleventh grade and as a 40-year-old.

Summary Guess what? The questions about what intelligence is and how it should be interpreted with which we opened the chapter has surely not been resolved; intelligence is still defined within the context of how it is being measured and probably always will be. But the problem there is that tests are defined to assess outcomes, not to define the outcomes themselves. Are we back to the beginning? Not quite. Through extensive research and test development, we know that people who score higher on tests of intelligence are often more adaptive and generally can figure out how to solve problems and many other types of challenges that we face each day. That doesn’t mean that your basic neighborhood genius can figure out how to remove a virus from their laptop or that your basic neighborhood average fellow who does just okay on an intelligence test does not excel in his job. It just means that, in some ways, intelligence is a social construct into which tests provide some insight. And because it is related to academic and work success, identifying a discrepancy between a person’s IQ and what is typical is a useful activity for those who care about success in school and in life.

Time to Practice 1.

Describe the behaviors of a person who you think is very intelligent. Now describe the behaviors of a person who you think is not very intelligent. (Stay away from any kind of physical characteristics, such as gender and head size.) Now, how do these two people differ from each

174  Part II 

■ 

Types of Tests

other, and is this enough information to start thinking about what you consider to be a definition of intelligence? 2. Given your definition of intelligence from the previous question, design what you think would be a terrific five-item test of intelligence, and provide a rationale for the items you designed. 3. Using library resources and the internet, find five different definitions of intelligence. Once you find them, determine what they might have in common with one another. What makes them different? Write one paragraph about how these similarities or differences would lead to different ways of measuring intelligence. 4. When Howard Gardner first presented his theory of multiple intelligences, there were only seven intelligences; the eighth (natural intelligence) was added later. Okay, now it’s your turn. See if you can come up with a ninth, and provide a rationale for why it is different from the other eight. 5. Discuss your answer to the following question with your classmates: How has the change in how intelligence has been measured (from Spearman to Gardner) reflected the role we think intelligence plays in our everyday lives? 6. Why do you think some intelligence tests instruct administrators to stop administering a subtest once they have determined the ceiling age? 7.

Search online or in your physical library for five articles from the popular press (Time, Newsweek, or any major newspaper such as the New York Times or Washington Post) on the general topic of intelligence. What aspects of intelligence do they cover? Which is the most interesting to you and why? What does the content of these articles reveal about how the measurement of intelligence might have changed over the years?

Want to Know More? Further Readings •

Gould, S. J. (1993). The mismeasure of man. New York: Norton.

If you are stuck on an island and have only one book to read (yikes!), this is one that you should consider picking. It reviews the measurement of intelligence and how definitions and the measurement have been used in a social context and how that has affected certain select groups of people. •

Doidge, N. (2007). The brain that changes itself: Stories of personal triumph from the frontiers of brain science. London: Penguin Books.

We used to assume that your individual level of intelligence doesn’t change, but can it? Maybe!

And on Some Interesting Websites •

The Consortium for Research on Emotional Intelligence in Organizations shows the application of this model at http://www.eiconsortium.org/.

Chapter 9 



■ Intelligence

Tests  175

Find out just about everything you ever wanted to know (and more) about human intelligence at the Human Intelligence: Historical Influences, Current Controversies, Teaching Resources website at http://www.intelltheory.com/.

And in the Real Testing World Real World 1 The whole issue of intelligence across cultures is amazingly interesting and just as complex, and the economic powerhouse of China finds itself developing and using more intelligence tests than ever. This article summarizes the development and use of intelligence tests in China, including the historical development of testing in that country, and reviews research using intelligence tests in China, including difference studies, correlational studies, and studies of special groups. Want to know more? Higgins, L. T., & Xiang, G. (2009). The development and use of intelligence tests in China. Psychology & Developing Societies, 21, 257–275.

Real World 2 Tons of data from the six most popular intelligence tests were carefully analyzed to understand general g. Perhaps the most interesting discovery was that “liquid intelligence” is so indistinguishable statistically from general g that they may be the same thing! Caemmerer, J. M., Keith, T. Z., & Reynolds, M. R. (2020). Beyond individual intelligence tests: Application of Cattell-Horn-Carroll theory. Intelligence, 79.

Real World 3 More on race and ethnicity, with some gender and educational factors tossed in as well. Not only are these variables filled with interesting asides, but studying them together also makes the whole set of issues increasingly complex. This study investigated the validity of the Wide Range Intelligence Test with participants ranging in age from 5 through 85 years and varied by the demographic variables of gender, race/ethnicity (White, African American, Hispanic), and education level (less than a high school degree, high school degree, some postsecondary training, college and beyond). Most of the comparisons revealed no statistically significant between-group differences. The majority of statistically significant differences that were found have little practical influence because they were so tiny. Want to know more? Shields, J., Konold, T. R., & Glutting, J. (2004). Validity of the Wide Range Intelligence Test: Differential effects across race/ethnicity, gender, and education level. Journal of Psychoeducational Assessment, 22, 287–303.

10 PERSONALITY AND NEUROPSYCHOLOGY TESTS It’s Not You, It’s Me Difficulty Index ☺ ☺ ☺ ☺ (moderately easy)

LEARNING OBJECTIVES After reading this chapter, you should be able to • Describe the purpose and approach of personality tests. • List the common steps for developing personality tests. • Explain the basic statistical strategy of factor analysis. • Compare and contrast several major personality tests. • Describe the purpose and approach of neuropsychological tests. • Explain why a battery of tests is usually used in neuropsychological testing. • Demonstrate understanding of the practice of forensic assessment. • Apply the concepts of validity and reliability to personality and neuropsychology tests. 

177

178  Part II 

■ 

Types of Tests

T

his is the most interesting thing about personality: Everyone has one, and by definition, it tends to be pretty stable across the entire life span (and that’s a really long time). Some of us are happier than others, some more reflective, some angrier, and some of us would rather meet on Zoom than go into the office. No wonder it’s such a rich area for psychologists and other social and behavioral scientists to explore. Some of us are very shy and not interested in meeting others, and some of us like to rock ’n’ roll at large gatherings, meeting all kinds of new people and being very adventurous. And what’s also interesting is that in spite of the fact that personality testing is, in many ways, not as “accurate” as other forms of testing (such as achievement), it is used in hundreds and hundreds of different settings—from businesses to prisons to hospitals. For example, the top Fortune 100 companies regularly use tests such as the Myers-Brigg, trying to find out how well employees might work together. You can find these tests used in the military, as criteria for admission into educational and training programs of various sorts, and in other areas where a good “reading” on what someone is like, and perhaps even why they are that way, take on importance. And, of course, personality tests are frequently used to assess psychological well-being and diagnose brain disorders.

WHAT PERSONALITY TESTS ARE AND HOW THEY WORK Personality tests measure those enduring traits and characteristics of an individual that aren’t physical and aren’t abilities—such things as attitudes, values, interpretations, and style. The measurement field tends to use the term personality very broadly to describe any test that assesses feelings, emotions, attitudes, and ways of thinking—not just what are technically personality traits (of which there may be just five—stay tuned!), but this chapter only covers big-deal official personality instruments and neuropsychological (brain) tests. Although everyone has their own personality, we can get a bit more specific and talk about personality traits and personality types. A personality trait is an enduring quality, like being shy or outgoing. A personality type is a constellation of those traits and characteristics, like being a narcissist (your coauthors are staring nervously at each other).

Objective or Projective: You Tell Me There are basically two types of personality tests: those that are objective and those that are projective.

Chapter 10 

■ 

Personality and Neuropsychology Tests  

Objective personality tests have very clear and unambiguous questions, stimuli, or techniques for measuring personality traits. For example, a structured item might be when a test taker is asked to respond yes or no to the statement, I get along with others. These are test items where there is no doubt about how the test taker can respond: yes or no, agree or disagree. And the interpretation is fairly straightforward (there are scoring rules and it is predetermined what a response “means”). The logic behind personality tests that are constructed like this is that the more items that are agreed with or checked off or somehow selected, the more of that trait or characteristic the test taker has. For example, here’s an item from the 192-item NEO 4, a widely used personality test published by Psychological Assessment Resources (and this item is used with their permission): I usually prefer to do things alone. Test takers indicate whether they strongly agree, agree, are neutral, disagree, or strongly disagree with the item. (By the way, you might look at this item and think “used by permission?! What is so clever about this item that someone claims to own it?” But remember the time and money it took to develop this item and do the research to conclude that it is valid and works well with other items to add to the reliability of the test. It’s worth being proud of!) Projective personality tests have ambiguous or unclear stimuli, and test takers are asked to interpret or impose onto these stimuli their own meaning. The most famous example of this you’ve seen in old movies or the New Yorker cartoons, the Rorschach inkblot test, where test takers are shown what appears to be a blot of ink on a card and asked to tell the test examiner what they see. Figure 10.1 shows an example of a Rorschach-like inkblot. The idea behind projective tests is that individuals can impose their own sense of structure on an unstructured event—the inkblot—and in doing so, they reveal important information about their view of the world and the characteristics that are associated with that view. Obviously, the inkblot is not a very structured stimulus, and such tests take a great deal of education and practice to interpret adequately. Projective tests arose from Freudian psychology and the idea that we have a subconscious that has feelings we might not be consciously aware of but still affect our behavior. By interpreting a picture, the thinking goes, our subconscious thoughts are “projected” onto the image. It’s a similar concept to the phenomenon that you might feel anger toward your therapist, but you are really projecting the anger you have toward your narcissistic coauthor (just to make up a hypothetical example).

179

180  Part II 

■ 

Types of Tests

FIGURE 10.1 

 Inkblot

Source: Wikimedia Commons. https://commons.wikimedia.org/wiki/File:Normalized_Rorschach_ blot_01.jpg

As you might imagine, the validity of projective tests (and also their reliability, because different test givers can disagree on interpretations) has been questioned. There’s an old joke about the Rorschach (and yes, that character from The Watchmen movie and comic book is named for this test because of the inkblot on his mask): A psychiatrist is showing inkblots to a patient who keeps describing the most disturbing and sexual interpretations of the images. Finally, the psychiatrist can’t withhold his disgust and yells at the patient, “What is wrong with you, you creep!” “Me?!” the patient replies. “You’re the one showing me these dirty pictures!” Another very common projective personality test is the Thematic Apperception Test, where the test taker is shown black-and-white pictures of “classical human situations” (there are 31 of them, with one card being blank). The person who is taking the test is asked to describe the events that led up to the scene in the picture, what is going on, and what will happen next. Test takers are also encouraged to talk about the people in the pictures and what they might be experiencing. You can tell from this description how much of a challenge it is to analyze such test-taker responses and then interpret what they mean regarding an individual’s personality. How Do You Feel? The earliest forms of personality tests were projective in that physicians interviewed patients, asking them questions about their emotions, feelings, and experiences. No particular response could have been expected, and it was the test giver’s skill at interpreting the patient’s statements that allowed some conclusion to be reached about the test taker’s well-being.

Chapter 10 

■ 

Personality and Neuropsychology Tests  

DEVELOPING PERSONALITY TESTS There are many different ways to develop items for a personality test. The two most popular are first, base the items on content and theory and, second, use a criterion group. Let’s look briefly at both.

Using Content and Theory Let’s say you want to develop a test that measures someone’s willingness to take risks as a personality characteristic. You’d probably say that this is a personality trait among people who like to do things such as climb mountains, go caving, drive fast, and participate in activities that many of us think could be dangerous. For us non–risk takers, we’re just as happy to be at home with a good book. In particular, theories (and all the research literature on this topic) of risk taking lead us to believe that we can predict what kinds of behaviors risk takers will exhibit. We can then formulate a set of questions that allows us to take an inventory of one’s willingness to take risks through simple self-report questions such as I would rather climb a mountain than play a noncompetitive sport. or I like to spend most of my spare time at home relaxing. If we follow a theory that says people with certain personality traits would rather do one than the other, we have some idea of what type of content can assess what types of traits. We would probably ask people if they agreed or disagreed with each item. The more items pointing in a particular direction that the test taker agreed with, the more evidence we would have that that trait or characteristic is present. And within this same way of constructing personality test items, we could also ask one of our experts on this particular trait we are calling risk taking whether they think that this item fairly assessed the trait. Sort of like the content validity we talked about in Chapter 4 (are the items on the test a fair representation of the items that could be on the test?). The strategy we just described is the way the Woodworth Personal Data Sheet (one of the earliest formal personality tests created 100 years ago) was developed. Woodworth went around and talked with psychiatrists about what they saw in their practice, as he also consulted lots and lots of literature on symptoms of neuroses. He then created items from the long list of comments he got and information he gleaned from what others had written.

181

182  Part II 

■ 

Types of Tests

Content validity plays just as important a role in the development of a personality test as it does in the development of an achievement test. The question is whether what you are testing for is evident in the items you are creating.

Using a Criterion Group The use of a criterion group is even more interesting. Here, the test developer looks for a group of potential test takers that differs from another group of potential test takers. And on what do they differ? Those personality traits and characteristics that are being studied! This is a process of discriminating between different sets of people who have already been classified as belonging to a certain group, behaving in a particular fashion, or having a particular occupation. For example, let’s say we are interested in developing a set of personality items that taps into whether people are depressed. Well, one way to create and test such items might be as follows: 1. Explore the previous scholarly literature on depression and get some idea of what distinguishes people who are depressed from those who are not depressed. 2. Generate a bunch of test items from this review of literature. For example, the literature you read may indicate that those who are depressed have trouble sleeping or have low motivational energy. 3. Review items and make sure they make sense. Do they tap into the trait or characteristic you want to test? 4. Use the good items (those that are clearly represented in your literature review and have good psychometric qualities) in the first draft of the test. 5. Most important, administer the set of items that has been created to two groups—those people who you think well represent the criterion group (perhaps those who have been officially diagnosed by doctors using the professional gold standard Diagnostic and Statistical Manual of Mental Disorders) and a random sample of people found at a park. 6. Now, in theory, you have these two groups, and you should be able to decide fairly well which of the items you created for the test discriminate best. Which items score higher in the depression group than in the other group? These become your items of choice. There’s a lot more work ahead, but this is a good start. A terrific historical example of this method is exemplified by the development of the 567-item Minnesota Multiphasic Personality Inventory (MMPI), first published in 1943.

Chapter 10 

■ 

Personality and Neuropsychology Tests  

The authors started at the beginning, where they generated as well as collected items from a variety of different sources, including their own case studies, books, journals, and anything else they thought might work. 1. All the items were reviewed, and they settled on 1,000 statements, which were then reduced to 504 after all similar types of questions and such were eliminated. 2. Criterion groups were selected: what the authors defined as a “normal” group (undergraduates at the University of Minnesota where the test was invented; that’s right, college students like you are considered normal!) and a clinical group consisting of patients in the psychiatric clinic at the university’s hospital. 3. Each of the 504 items was given to all participants. 4. The items that discriminated best were kept, and the ones that did not were either revised or discarded. 5. These best items were used to create the test. 6. The final scales themselves were created. One would expect the participants in the clinical sample to score differently from the participants in the “normal” group, and they indeed did. In fact, today’s MMPI contains 10 different scales that represent the areas in which the groups distinguished themselves from one another. These areas are as follows: Scale 1. Hypochondria (concerns about illness) Scale 2. Depression (clinical depression) Scale 3. Hysteria (converting mental conflicts into physical attributes) Scale 4. Psychopathic deviate (history of antisocial or delinquent behavior) Scale 5. Masculinity femininity (masculine and feminine characteristics) Scale 6. Paranoia (concerns about being watched or persecuted) Scale 7. Psychasthenia (obsessive behaviors) Scale 8. Schizophrenia (distortions of reality) Scale 9. Hypomania (excessive mood swings) Scale 10. Social introversion (high degree of introversion or extraversion) And these 10 scales became the factors or dimensions along which each taker of the MMPI is rated. Administering (and understanding the results of) personality tests is not for mere mortals like you and me. We could give them, sure, but understanding the results is a whole different story. Almost all users of these tests have undergone years

183

184  Part II 

■ 

Types of Tests

of extensive training; they usually have an advanced degree in clinical psychology, counseling, or school psychology; and they are approved by the American Psychological Association. These roles are definitely not for the armchair psychologist. Want more information about becoming one of these talented folks? Visit http:// https://www.apa.org/education-career#ca reers and read about a career as a psychologist.

THE MANY DIMENSIONS OF FACTOR ANALYSIS Factor analysis surely is a mouthful of words. It sounds (and is a little) scary, but this statistical method for validating personality tests and other measures that assess complex constructs reveals the hidden dimensions of what is being measured. First, a bit about the technique. Factor analysis is a way of looking at relationships between variables and helping identify which of those variables relate to one another. And if some variables do relate to one another more than to other variables in the set, we might call this group of variables a factor. For example, if you found that Variables A, B, and C in a set of 10 are related to one another (there is a large correlation between each pair of variables), and then examined them closely for what they measure (let’s say that Variable A is reading comprehension, B is reading fluency, and C is correct use of reading strategies), we might collectively call this shared underling variable a factor and name it reading ability. When we use factor analysis, we look for patterns of commonality. This method can be used to find correlational patterns among items as well, and it is used by test developers in the real world all the time. The strategy for developing a test based on factor analysis is to administer a whole bunch of tasks—which may be objective or projective—and then correlate scores on one task with all other tasks. Next, you look for similarities about sets of tasks (those items that correlate together) and give them a factor name, such as shyness or confidence. And you conclude that all those items represent that variable, as they “load” on that factor. The Sixteen Personality Factor Questionnaire (also known as the 16PF, now in its fifth edition) is one such test that was developed based on the research of Raymond Catell (the same fellow who helped create the Horn-Cattell intelligence model presented in Chapter 9). After the collection of many items, administration, and factor analysis, the following 16 factors (on all of which people can and do differ) form the structure of the test.

Chapter 10 

■ 

Personality and Neuropsychology Tests  

• Warmth (reserved vs. warm)

• Apprehension (self-assured vs. apprehensive)

• Reasoning (concrete vs. abstract)

• Dominance (deferential vs. dominant)

• Emotional stability (reactive vs. emotionally stable)

• Liveliness (serious vs. lively)

• Social boldness (shy vs. socially bold)

• Rule consciousness (expedient vs. rule conscious)

• Sensitivity (utilitarian vs. sensitive)

• Openness to change (traditional vs. open to change)

• Vigilance (trusting vs. vigilant)

• Self-reliance (group oriented vs. selfreliant)

• Abstractedness (grounded vs. abstracted)

• Perfectionism (tolerates disorder vs. perfectionistic)

• Privateness (forthright vs. private)

• Tension (relaxed vs. tense)

The NEO 4 (and you saw a sample item earlier) is also a very well-known personality test that was constructed based on a factor-analytic framework. The scores from many different personality tests and items were factor analyzed and it was found that there were five, only five, personality traits that seemed to explain all the scores, even though the tests were believed to measure dozens of different traits! So the test is based on a five-factor model, as follows: • Neuroticism • Extraversion • Openness • Agreeableness • Conscientiousness Examples of items that tap into these domains are 17. I often crave excitement. 50. I have an active fantasy life. 127. Sometimes I trick people into doing what I want. In the development of personality tests using the factor analysis approach, the identification and adherence to a particular personality theory are the most important elements.

185

186  Part II 

■ 

Types of Tests

One of these tests is the (Allen) Edwards Personal Preference Schedule, based on personality research by Henry Murray and first presented in 1938. Edwards created an inventory of items based on the characteristics or needs that Murray identified. The current form of the Edwards scale consists of 210 pairs of statements and 15 different scales such that the test takers choose which of the two statements most characterizes themselves. The scales are as follows: • Achievement

• Dominance

• Deference

• Abasement

• Order

• Nurturance

• Exhibition

• Change

• Autonomy

• Endurance

• Affiliation

• Heterosexuality

• Interception

• Aggression

• Succorance

Those Edwards scales have some interesting names, don’t they? Most are self-explanatory, but here’s some help on those that may not be. Interception is the need to analyze other people’s behaviors, and succorance is the need to receive support and attention from others. And just in case you’re really curious, abasement is the need to accept blame for problems and confess errors to others. (Interestingly, all three of these terms appear in Bruce’s LinkedIn bio.) The Dating Game. You’ve heard all about dating services such as Match.com and eHarmony. Well, as it turns out, many of these are based on some scientific rationale that sounds a lot like the discussions we are having about personality tests in this chapter. For example, let’s take eHarmony. It uses a model that is based on clinical research that in theory matches people based on certain characteristics that are associated with successful relationships. In other words, it uses what the literature has revealed works, as far as people being compatible, successful, and so forth, and then tries to match clients on those. How well does it work? Every one of these services claims success, but there’s little empirical data that address the efficacy of the matches once they are made, let alone the long-term outcomes. Look to the future.

WHAT THEY ARE: A SAMPLING OF PERSONALITY TESTS AND WHAT THEY DO Table 10.1 shows you five personality tests and some information about what they do and how they do it. Notice the diversity, especially in their format—from an inkblot (the Rorschach) to a test where the test taker draws a person. So many ways to show that you’ve got personality!

Chapter 10 

TABLE 10.1  Title/Acronym (or What It’s Often Called) Minnesota Multiphasic Personality Inventory (MMPI-2)

■ 

Personality and Neuropsychology Tests  

187

  Some Widely Used Personality Tests

Purpose and What It Tests The assessment of a number of the major patterns of personality and emotional disorders

Grade Levels/ Ages Tested Ages 18 and over

Conceptual Framework Based on the distinction between normal and pathological groups

What’s Interesting to Note • Rejected by several publishers in the 1940s, but it went on to become one of the most popular tests. • Widely used as a research tool • The original Minnesota normal sample consisted of 724 relatives and other visitors at University of Minnesota hospitals.

NEO Personality Inventory. The current version is the NEO-PIRevised. Rorschach inkblot

The measurement of five major dimensions or domains of normal adult personality

Ages 17 and over

A tool for clinical assessment and diagnosis

Ages 5 and over

Five-factor model of personality: Openness to experience, conscientiousness, extraversion, agreeableness, and neuroticism

• Based on original NEO personality test

Psychoanalytic conceptual basis (based on Freud’s theory that human behavior is conflicted)

• Hermann Rorschach experimented with inkblots for 4 years; he died at age 37, within months of his book’s publication, and never saw the further development of his test.

• Normative group consisted of volunteers.

• This test has always been controversial in its use, but also quite popular. 16PF Adolescent Personality Questionnaire (APQ)

The measurement of normal personality of adolescents and identification of personality problems

Ages 11 to 22

Based on the earlier 16PF, which assesses 16 bipolar dimensions: • Warmth (reserved vs. warm) • Reasoning (concrete vs. abstract)

• Assesses normal personality • Assesses adolescents’ reasoning ability and work activity preferences • Can provide career counseling information

• Emotional stability (reactive vs. emotionally stable) • Dominance (deferential vs. dominant) (Continued)

188  Part II 

■ 

TABLE 10.1  Title/Acronym (or What It’s Often Called)

Types of Tests

 (Continued)

Purpose and What It Tests

Grade Levels/ Ages Tested

Conceptual Framework

What’s Interesting to Note

• Liveliness (serious vs. lively) • Consciousness (expedient vs. rule conscious) • Social boldness (shy vs. socially bold) • Sensitivity (utilitarian vs. sensitive) • Vigilance (trusting vs. vigilant) • Abstractedness (grounded vs. abstracted) • Privateness (forthright vs. private) • Apprehension (self-assured vs. apprehensive) • Openness to change (traditional vs. open to change) • Self-reliance (group oriented vs. self-reliant) • Perfectionism (tolerates disorder vs. perfectionistic) • Tension (relaxed vs. tense) Draw a Person (DAP)

Originally Ages 7 developed to to 12 assess the personalityemotional characteristics of sexually abused children

Examines five constructs: • Preoccupation with sexually relevant concepts • Aggression and hostility • Withdrawal and guarded accessibility • Alertness for danger/ suspiciousness • Lack of trust

• Very difficult to establish the reliability and validity for such tests. Quantifiable scoring procedures have recently been developed. • Author cautions that allegations about sexual abuse should not be based on only one test. • A very unique approach to testing for personality and emotional difficulties that may be related to sexual abuse

Chapter 10 

■ 

Personality and Neuropsychology Tests  

WHAT NEUROPSYCHOLOGICAL TESTS ARE AND HOW THEY ARE USED Neuropsychology is the study of the relationship between the brain and behavior. Neuropsychological tests are the assessment of cognitive skills based on the performance of certain tasks, such as remembering a list of words or arranging a group of shapes to match a particular design. These tasks provide insight for highly trained professionals (such as neuropsychologists, rehabilitation psychologists, and psychiatrists) to better understand how the brain affects behavior and, when things go wrong (infection, disease, accidents), which prescriptive steps might be taken to help. “Neuropsych” tests are usually administered batteries (groups of related tests), and assess such areas as intelligence, memory, language skills, and spatial skills, among others. Based on the results of such batteries, neuropsychologists, physicians, psychiatrists, rehabilitation specialists, and others draw conclusions regarding the status of the person being tested, often attributing test outcomes to physical changes in neurological status, such as brain damage, stroke, dementia, and other neurological diseases. Neuropsychological testing or evaluation is the assessment of how the brain is working and how it is related to behavior. Phillip Harvey from the University of Miami School of Medicine identified six different uses for neurological assessment: 1. Diagnosing certain types of neurological disorders, such as dementia or stroke 2. Distinguishing between conditions, such as different types of memory loss as opposed to dementia 3. Assessing functional potential or what the individual may be potentially able to do 4. Assessing the course of any degenerative conditions 5. Measuring the speed of recovery 6. Measuring the speed of response from treatment Some of the conditions for which neuropsychological testing is used (given the six general purposes listed earlier) are • attention-deficit hyperactivity disorder; • epilepsy; • autism spectrum, including Asperger’s syndrome; • learning disabilities;

189

190  Part II 

■ 

Types of Tests

• dyslexia; • concussion; • brain injury; • dementia; and • seizures. What’s different about neuropsychological testing compared with other kinds of tests that evaluate behavior? First, neuropsychological tests are always standardized, which means they are given in the same way to all people being tested and are scored in the same way as well. Second, the subject’s performance is compared with healthy individuals from the same social, economic, and demographic background (including race and gender). Performance can also be compared across time—how did the subject do at age 50 compared to now? Third, no one score from one test in the battery of tests is the definitive marker for neuropsychological health. Rather, the psychologist uses many different tests and scores to reach a conclusion. The “profile,” multiple scores measuring different things, is interpreted. And special statistical methods can be used to identify common patterns in these profiles that might be associated with certain types of injuries or illnesses.

NOT JUST ONE: THE FOCUS OF NEUROPSYCHOLOGICAL TESTING Neuropsychological testing takes place one test at a time, but the clinician or the person interpreting the scores considers each group of tests to be a battery, and it’s through careful consideration of the battery of scores that they can reach a conclusion. Here is a bit more detail about each of these categories and some of the tests that are used.

Intelligence While in Chapter 9 you learned about the assessment of intelligence within educational and other settings, for the neuropsychologist, the assessment of intelligence is for comparison’s sake to see how an individual with, for example, a traumatic brain injury might score differently from what they did in the past before their injury. Some tests of intelligence that a neuropsychologist might use are the entire set of Wechsler tests, such as the Wechsler Adult Intelligence Scale (WAIS) and the Wechsler Test of Adult Reading (WTAR), and the National Adult Reading Test (NART). For example, the WAIS has an extensive history dating back to the early exploration of the definition of intelligence and its assessment in 1939. The

Chapter 10 

■ 

Personality and Neuropsychology Tests  

most current version, published in 2008, provides four index scores representing measures of verbal comprehension, perceptual reasoning, working memory, and processing speed. From performance on these, two general indexes of intelligence can be derived, including a full-scale intelligence quotient and a general ability index. And, as you might expect, working memory and processing speed especially are good indicators of how well the brain is functioning. Why So Many Tests? It’s really important to respect how complex the human brain is and how multifaceted and multidimensional its operation. The only way scientists can get a good reading as to what’s going on and obtain the kind of information essential to understanding neuropsychological deficits or conditions is through assessing many different dimensions of functioning. Some of these dimensions overlap and may be correlated with one another, but all yield important information that contributes to a better understanding of such debilitating conditions as traumatic brain injury, Alzheimer’s disease and dementia, and aphasia, a language disorder that is often the result of a stroke.

Memory While there are many different types of memory (such as long-term memory and short-term memory), memory is viewed as a broad collection of cognitive operations. Some tests of memory that a neuropsychologist might use are the Memory Assessment Scales, the California Verbal Learning Test, and the Test of Memory and Learning. For example, part of the California Verbal Learning Test has the person administering the test read a list of 16 nouns in a fixed order at 1-second intervals, and after each trial, the person being tested is asked to recall as many words as possible with no regard to order. The words that are used on the test are from one of four categories—tools, fruits, clothing, and spices and herbs. Then what is called an interference list is presented, and short delay and long delay are tested again.

Language This is such a basic function of human behavior and a critical one to evaluate when considering any damage to the brain and related neurological systems. Some tests of language that a neuropsychologist might use are the Multilingual Aphasia Test and the Boston Naming Test. For example, the Boston Naming Test, first published in 1983, consists of 60 items, each a picture. The examinee (the person being tested) is given about 20 seconds to identify each picture.

Executive Function Executive function is a general term that includes problem solving, organizational skills, planning, and other cognitive processes.

191

192  Part II 

■ 

Types of Tests

For example, the Figural Fluency Test, the Wisconsin Card Sorting Test (or WCST), and the Stroop Test all assess this broad category. The Stroop Test was introduced as the Stroop task or Stroop effect in 1935 by, John Stroop (makes sense!). While there are many different versions of the Stroop Test, they tend to have key things in common. A written color name differs from the color ink it is printed in, and the participant must say the written word. In the second trial, the participant must name the ink color instead. This book is not printed in a sufficient number of colors for you to see what the test looks like, but you can see it online at https://faculty.washington.edu/chudler/java/ready.html.

Visuospatial Ability Visuospatial ability tests combine the areas of visual perception and visual integration. Some tests of visuospatial ability are the Clock-Drawing Test, the Hooper Visual Organization Test (VOT), and the Rey–Osterrieth Complex Figure Test (ROCFT). For example, the ROCFT is based on the 1941 work of the Swiss psychologist André Rey. The person taking the test is asked to reproduce a complicated line drawing, first by copying it freehand (recognition) and then by drawing from memory (recall).

FORENSIC ASSESSMENT: THE TRUTH, THE WHOLE TRUTH, AND NOTHING BUT THE TRUTH Forensic psychology is a branch of psychology that explores the intersection of neuropsychology and the justice system (and much of it revolves around testing in the area of neuropsychological assessment, as discussed earlier). Once again, the focus is on the brain and behavior, but the circumstances and the setting are different. Forensic assessment is the assessment of this relationship between the mind and criminal behavior and is used to examine important variables in everything from understanding the role of expert testimony to the declaration of someone not being competent to stand trial to child custody issues. And in particular (and particularly relevant to this chapter), forensic neuropsychologists (sometimes called rehabilitative psychologists, who also depend a great deal on measures of personality and brain function) often appear as expert witnesses in court to discuss cases that involve brain damage.

What Forensic Assessment Does Forensic assessment takes into account a very wide and diverse set of tasks and topics. For example, here’s a partial list of what a forensic assessment expert might deal with in the course of their daily work:

Chapter 10 

■ 

Personality and Neuropsychology Tests  

• Assessing the status of an individual to determine if they are competent to participate in their own defense in a criminal trial • Creating a written agreement with the court on the purpose and scope of testing to take place • Assisting the court in deciding whether a claim (such as the result of injury from an accident) is valid • Educating members of the court and legal system, including attorneys, judges, and police officers, as to the conditions under which individuals might need therapeutic help • Evaluating an individual who is being considered for institutionalization • Examining the circumstances surrounding child custody issues and making recommendations to the court So you have some idea what the forensic psychologist might do in their role in measuring and assessing outcomes. But how might that differ from the therapeutic or counseling side of assessment? When an individual consults with a psychologist to treat conditions such as a personality disorder or depression, we are talking about the psychologist’s role in therapeutic assessment. Here, assessment focuses on what the problems might be and on recommending a course of treatment in an effort to address those problems. On the other hand, forensic assessment happens when an attorney or the court requests it, and forensic assessment techniques are used to determine the facts surrounding the case. In fact, within forensic assessment the client is often the court or the legal system. So in effect, one (the therapeutic assessment) is initiated by the client or patient with their agreement while the other (the forensic assessment) is initiated by the legal system, with or without the individual’s agreement.

VALIDITY AND RELIABILITY OF PERSONALITY TESTS Validity is a unitary concept, and all types of validity arguments and evidence can come into play in determining validity of a measure, but certain validity approaches seem particularly important for different tests with different purposes. Validity and reliability requirements for personality tests depend on whether they are objective tests or projective tests. Objective tests are similar to achievement tests and require good evidence of content-based validity and sometimes that is enough. For instance, tests that are made up of items that distinguish between groups often don’t have a strong theoretical framework from which questions are derived. They just use whatever works. If a love of sailboats is more common among schizophrenics, then a sailboat question is on the test. Usually though, as personality traits are often well-defined constructs, attention is paid to demonstrating construct-based

193

194  Part II 

■ 

Types of Tests

validity. Validity arguments for projective tests almost always are connected to the hypothesized mechanisms by which they supposedly work—there is an active, dynamic subconscious that affects our conscious thoughts and behaviors and can be revealed by interpreting images or doing word association. If this mechanism isn’t a real part of human thinking, then projective tests cannot be valid. Reliability is another concern for projective tests, because the scoring is subjective. Subjective scoring requires human judgment and humans can disagree and make unpredictable errors and act inconsistently. That tends to result in low interrater reliability.

Summary Personality is a fascinating dimension of human development. The assessment tools that test developers have created over the past 100 years reflect the ideas and theories that underlie various explanations of individual differences in personality. And even though personality testing is not as precise as we might like (interpretation goes a very long way), these kinds of tests are being used more often than ever in our society for everything from placement in a new job situation to finding a lifelong partner.

Time to Practice 1. Why do you think that personality is such a fascinating concept to study and assessing it is so very difficult? 2. Name one way to develop a personality test, and provide an example of how you would do it. 3. What is the primary difference between an objective and a projective personality test, and why would you use one over the other? 4. Go to the library and find a journal article that reports an empirical study on some aspect of personality. What test did the authors use? What variables were they interested in studying? Were the reliability and validity data about the personality test they used reported? Do you think that the results reflected an accurate picture of what role personality plays in what was being studied? 5. Have some fun. Write two to three sentences for each of five friends, describing their personalities. Try to do them as independently as you can, without a description of one friend interfering with a description of another friend. Now see if you can draw some similarities and contrasts in your descriptions. What do they have in common? How are they different? Good! You’ve started creating a taxonomy of important variables to consider in describing personality—the first step on the way to creating a test of personality. 6. Develop three categories you would like to see added to the 16PF, being sure to include the title and the two “poles.”

Chapter 10 

■ 

Personality and Neuropsychology Tests  

7. What personality test might you administer (once you get the proper training and credentials!) to a 6-year-old presenting with multiple concerns? 8. The developers of several personality tests recommend avoiding making definitive interpretations based on one test alone. Why do you think the developers make this recommendation? What other information could you use to add clarity to your conclusions from a personality test? 9. What is one primary purpose of neuropsychological testing? 10. What’s one major and significant difference between the administration of neuropsychological tests and other tests? 11. What’s a primary difference between the use of forensic testing and other types of testing? 12. Ripped from the headlines! Online, in print, or even through a visit to your local courthouse, find a case where an expert testified at a trial. See if you can answer the following questions: a. What is the expert’s area of expertise? b. Why was the expert testifying? c. In your opinion, how did the expert’s testimony impact the outcome of the trial or hearing? d. If available, what assessment tools were used to evaluate the individuals involved in the court action?

Want to Know More? Further Readings •

Panayiotou, G., Kokkinos, C. M., & Spanoudis, G. (2004). Searching for the “Big Five” in a Greek context: The NEO FFI under the microscope. Personality and Individual Differences, 36, 1841–1854.

This study evaluates the psychometric properties and factor structure of the Greek FFI (based on the NEO we talked about in this chapter) and provides normative information for its use with Greek populations. It’s an interesting application of a test developed in one culture and then used within another. •

Varela, J. G., Boccaccini, M. T., Scogin, F., Stump, J., & Caputo, A. (2004). Personality testing in law enforcement employment settings: A meta-analytic review. Criminal Justice & Behavior, 31, 649–675.

How well can personality measures predict the performance of law enforcement officers? That’s one purpose of this study, which used meta-analysis as an analytic tool, and the researchers found a statistically significant relationship between personality test scores and officer performance.

And on Some Interesting Websites •

The Clifton StrengthsFinder™ at http://strengths.gallup.com/110440/AboutStrengthsFinder-20.aspx is a web-based assessment tool published by the Gallup organization

195

196  Part II 

■ 

Types of Tests

(yep, the poll people) to help people better understand their talents and strengths by measuring the presence of distinct themes of talent. You might want to take it and explore these themes. On this site, you can watch a video all about it. •

Just for fun (really), complete the Carl Jung and Isabel Briggs Myers typology at http:// www.humanmetrics.com/cgi-win/JTypes2.asp. Though the Myers Briggs assessment has questionable validity for uses other than its narrow intended purposes, it may provide insight into your ways of thinking.

And in the Real Testing World Real World 1 Nowadays, you can take a personality test on your iPad. Good idea? Maybe, maybe not. But the comparison between electronic and paper-and-pencil tests is inevitable, and here’s a study that explores that very question. The results suggest that there is comparability of scores for many personality constructs. This study was done in the 2000s and now computer-based assessments are more common than paper and pencil! Want to know more? Meade, A. W., Michels, L. C., & Lautenschlager, G. J. (2007). Are internet and paper-and-pencil personality tests truly comparable? An experimental design measurement invariance study. Organizational Research Methods, 10(2), 322–345.

Real World 2 A perfect example of using neuropsychological testing to distinguish between conditions concerns frontotemporal dementia (FTD), which is one of the most common forms of early-onset dementia. It is often mistaken for psychiatric or neurological diseases, causing a delay in correct diagnosis. While genetic components are established risk factors for FTD, the influence of lifestyle, comorbidity, and environmental factors is still unclear, hence the need for comprehensive use of tests of neuropsychological abilities. Want to know more? Rosness, T. A., Engedal, K., & Chemali, Z. (2016). Frontotemporal dementia: An updated clinician’s guide. Journal of Geriatric Psychiatry and Neurology, 29, 271–280.

Real World 3 This study examined a number of neuropsychological tests for use with sub-Saharan African schoolage children. The tests assessed skills similar to those measured by assessments of cognitive development published for use in Western contexts. Culturally appropriate adaptations and their results were found to be moderately reliable, with scores on individual tests related to various background factors at the level of the child, household, and neighborhood. School experience was the most consistent predictor of outcomes, and the researchers conclude that salient background characteristics should be taken into account when measuring the discrete effects of disease exposure in similar sociocultural and economic settings. Want to know more? Kitsao-Wekulo, P., Holding, P., Taylor, H. G., Abubakar, A., & Connolly, K. (2013). Evaluating contributions to variability in test performance. Assessment, 20, 776–784.

11 CAREER CHOICES Have We Got a Job for You! Difficulty Index ☺ ☺ ☺ ☺ (moderately easy)

LEARNING OBJECTIVES After reading this chapter, you should be able to • Explain the purpose and approach of career development tests. • Describe the format and history of the Strong Interest Inventory. • Describe “Holland codes” and how they work. • Discuss the limitations of career development tests. • Compare and contrast several major career development tests. • Apply the concepts of validity and reliability to career development tests. 

W

hat’s the one question that grown-ups ask all the kids at every family gathering?

So, what do you want to be when you grow up? Some of these fine young people have an idea; most of them don’t. (Bruce once asked his 4-year-old cousin this question, and he answered, “10.”) But even those who do will be changing their minds many times before they settle on their first of several jobs or professions. And most interesting, it’s not that much different

197

198  Part II 

■ 

Types of Tests

for adults. Although many have a career for which they have trained years and years (such as physician or lawyer), they, too, may change their employment focus and select another vocation depending on a variety of factors, including compensation, benefits, location, and the associated lifestyle that comes with this or that job. Counseling psychologists, in fact, will tell you that most people change their careers several times in a lifetime and happenstance or chance is the main reason why. So any way we look at it, from anyone’s vantage point, young or old, male or female, white-collar worker or blue-collar worker, many, many more people change jobs throughout their employment career than keep the same one. And with increased opportunities for technical and higher education, this is happening more than ever before. This chapter is about how tests of vocational choice or career development can help people make decisions about jobs that are more compatible with their likes and dislikes and their aspirations and goals. The end result may very well be more people who are working in an area they like and enjoy and are therefore happier. Who could argue with that? A Rose by Any Other Name. All the information in this chapter falls under a very general category of vocational testing or assessment, but you might also see it called career development or employment evaluation. Whatever term is used, the theme is the same: helping people better understand how their interests and personalities might best be matched up with a vocation or job.

WHAT CAREER DEVELOPMENT TESTS DO It’s no surprise that career development is as much about what a person is as it is about what that person wants to do. Years of research in this area have verified that one’s experience, values, and personality characteristics have a great deal to do with the choices one makes regarding a career, and those factors have to be taken into account. In fact, what vocational psychologists or career development psychologists would like to think is that they examine not only what people’s interests are but also what would provide individuals with a high level of job and (perhaps) life satisfaction. So it is not just that an individual takes the Strong Interest Inventory or the Campbell Interest and Skill Survey, and the test results come back and indicate that a person is best suited to be a mechanic, a florist, or a physicist. Rather, the test scores reveal something about that person’s interests and where they might best fit. But in conjunction with those test scores, there’s another very important set of factors—including personality, values, and desires—that enters into the equation, and that’s where a terrific career counselor can help. To quote a leading

Chapter 11 

■ Career

vocational psychologist, Tom Kreishok, “The last thing about career counseling is the career—it’s all about the person.” Well said.

LET’S GET STARTED: THE STRONG INTEREST INVENTORY One of the best examples of a vocational interest or career planning test is the Strong Interest Inventory (SII), which is a direct outgrowth of the Strong Vocational Interest Blank (SVIB). Edward Strong was very clever in his design and development of the original SVIB in 1928, and it all started with his work in trying to place military personnel in suitable jobs. He thought that to get a good reflection of one’s vocational interests, questions should be asked about how much the test taker likes or dislikes various occupations, areas of study, personality types, and leisure-time activities. Then he found out how people who are currently in certain occupational positions would have answered these very same questions. By empirically keying those answers (from folks already working) with the responses of the test takers (the folks who may work in this or that field), he could pretty well look for a fit between people and occupations. (This sounds like the “distinguishing between groups” strategy for developing personality tests we talk about it Chapter 10, doesn’t it?) Is it perfect? No. Does it work pretty well? Yes. Strong’s early work is based on the assumption that the most successful people in an occupation have a certain profile of characteristics that reflects what it takes to be successful. If a test taker identifies with that profile, the likelihood is pretty good that they will succeed in that profession as well. Last revised in 2012, the current edition of the SII consists of 291 items that align on 30 basic interest scales organized into six different sections. The first five sections of the test ask test takers to indicate whether they like, dislike, or are indifferent to certain topics, occupations, or human characteristics. So this part of the test might look something like this: Like Assertive people Mathematics Athletes Psychology Shy people Fixing things

Dislike

Indifferent

Choices  199

200  Part II 

■ 

Types of Tests

Once the test is scored (and this is usually done by the publisher, although a newer product, the iStartStrong Report, allows scoring and interpretation by the test taker—nice!), a long and detailed printout of the test taker’s score and profile is available. Scores are assigned to each of six general occupational themes, and much to the credit of the test publisher (Consulting Psychologists Press), these themes have kept up with the times and the latest revision includes many more items that reflect technology and changes in the business world. The six general occupational themes are as follows: • Realistic. These people prefer activities that are practical and hands-on and that result in tangible results, such as building, repairing, or fixing. Example vocations could be carpenter, roofer, builder, and carpet installer. • Investigative. These people prefer to solve problems that involve science and engineering, and they enjoy challenging situations that demand intellectual activity. Example vocations could be chemist, biologist, engineer, and forensic scientist. • Artistic. These people prefer activities that include the self-expression of ideas. Examples could include a musician or a filmmaker and activities such as creating or enjoying art. • Social. These people like to help others and find themselves attracted to situations where social causes are important. Example vocations could include teachers, social workers, and psychologists. • Enterprising. These people prefer activities that include self-management and leadership, in addition to selling. Example vocations could be sales, management, and advertising. • Conventional. These people like activities where structure and order are a part of the work world. Example vocations could include data analysis, secretaries, and accountants. The results of the test identify your highs and lows on the six descriptors, some of the suggested professions in your preferred areas, and the knowledge and skills required, as well as the tasks relevant to those professions. For example, part of a report output would contain something like this: Occupation

Knowledge or Skills

Tasks

Teacher

Child development, planning activities, methods of designing curricula, empathy

Teaching students, communicating with parents, working collaboratively with other teachers

Chapter 11 

■ Career

The test-taker’s scores suggest both the environment in which a person might work and the sorts of things they might enjoy doing. Then it’s just a matter of the scoring system matching up the test taker’s profile with the extensive database of profiles for different occupations the publisher has on hand. As you might expect, these profiles change with each revision of the test. After all, 30 years ago, computer programming was not as prominent a profession as it is today. And there were more VCR repair specialists (whatever those were) than “influencers” (whatever they are).

JOHN HOLLAND AND THE SELF-DIRECTED SEARCH The six occupational themes used by the Strong Interest Inventory were developed by John Holland, when he revised the test in the 1970s, and they turn up again and again in the literature on vocational testing, and for good reason. They are a very clearly thought out, empirically validated foundation for assessing important clues to career development preferences. Holland’s model is reflected in another career aptitude measure, the Self-Directed Search (or SDS), which shows how you can even do this exploration all on your own, as it is self-administered, self-scored, and self-interpreted—really taking this important area of assessment directly to you! And you can now take the test online and even through a mobile application. Yippee! (And another reason there are now fewer “guidance counselor” jobs than, say, Uber drivers in this changing world.) While waiting at the college admission’s office, you can take the SDS on your phone, and know what you want to do by the time you get to the front of the line. In Figure 11.1, you can see how the different vocational themes can be organized to form a hexagon. These themes (also called work personalities) are not arranged in a willy-nilly fashion, with any theme going anywhere. Rather, they are organized so that adjacent themes are similar to each other and opposite themes are dissimilar to each other. For example, in Holland’s model, social and artistic occupations seem to have several things in common; that’s why they are adjacent to each other. People who select a particular vocational track within either of these themes (such as filmmaker—the artistic theme) might be comfortable in a vocation pointing toward the other as well (such as teacher—the social theme). Well, once Holland developed these themes, he could find individuals in particular vocations, administer his test, and find out the pattern of prominent themes for that vocation. What the test produces is a three-character code (called a Holland code) that characterizes an individual’s strengths and interests relative to the descriptions within Holland’s six different themes. And you can place yourself within an imaginary six-sided space defined by those themes like the one shown in Figure 11.2. For example, with a code of RES, you would most resemble the

Choices  201

Types of Tests

FIGURE 11.1 

 Holland’s Six Vocational Themes and How They Work Together Conventional

Enterprising

Realistic

Social

Artistic

FIGURE 11.2 

Investigative

  Holland Code Career Model

Business Contact (Enterprising)

Business Operations (Conventional) Data

■ 

Social Service (Social)

People

Things

Technical (Realistic)

Ideas

202  Part II 

Arts (Artistic)

Science (Investigative)

realistic type, less resemble the enterprising type, and even less resemble the social type. If your code does not include a particular type, then it means you resemble this type of person (who would be happy in a particular profession) least of all. Holland (and his coauthors) compiled a dictionary of these codes, the Dictionary of Occupational Codes, with close to 13,000 occupations, from helicopter pilots (RIS)

Chapter 11 

■ Career

to blood donor recruiters (ESC) represented. Here’s just a sample of the selection of codes and the accompanying, suggested occupations. Occupation

First Theme

Second Theme

Third Theme

Director of a day care center (ESC)

Enterprising

Social

Conventional

Medical assistant (SCR)

Social

Conventional

Realistic

Director of nursing (SCE)

Social

Conventional

Enterprising

Tax preparer (CES)

Conventional

Enterprising

Social

A good way to test Holland’s model would be to see what the correlations (see Chapter 3) are between adjacent and opposite occupational themes, right? Consider it done. Lenore Harmon and her colleagues did such and found that, indeed, the data support the theory. For example, there is a correlation between scores on the adjacent social and enterprising occupation themes (r = .42) and the adjacent themes of realistic and investigative (r = .52), but practically none between the scores on the opposite occupational themes of artistic and conventional (r = –.04) or the themes of realistic and social (r = .06). Want to find out all there is to know about jobs, what they pay, who’s needed, and more? Visit the U.S. Department of Labor’s website at http://www.dol.gov/ and view the info about occupations at the Bureau of Labor Statistics at https://www.bls.gov/oes//, where you can find out such neat things as the prospects for welders, which is just average (there will be a 3% growth in welding positions), with the current median salary at about $44,000 per year. If you’re interested in being a physician, your salary can be high (about $200,000 per year on average), and the demand is high as well (especially for family medicine and primary care). There is also an extensive research library on this site, as well as links to state and other federal agencies that deal with labor and work issues.

SOME MAJOR CAVEATS: CAREER COUNSELING 101 Here comes the buyer-beware commercial. Richard Nelson Bolles wrote the universe’s best guide to career choices, how to make them, and so on. It’s called What Color Is Your Parachute? (published by Ten Speed Press), and it is in its bazillionth edition. You should get a copy right away—no matter where you are in your career path. This fellow is the guru of vocational counseling and has five interesting and important rules about taking any vocational test. First, not all tests are right for all people. Some people, in fact, hate tests altogether. So the test itself has to be the kind in which the test taker is willing to invest the time and energy for the results to be useful.

Choices  203

204  Part II 

■ 

Types of Tests

Second, no test is the test for everyone. Some tests, because the format is better liked, perhaps, may provide different results than other tests (even when the same person takes both tests). It’s a matter of everything, including the type of test, the items, the format, and more. Career development isn’t so finite a process that one test, no matter how good it is, will provide all the answers. So should you take more than one vocational test? Sure—if you have the time and money, what could it hurt? High school students often take these tests and get to go over the results with school counselors, and that’s a great added resource for figuring things out. Third, Bolles believes what the great psychoanalyst Sigmund Freud said as well: When it comes to important decisions, trust your emotions. In Bolles’s case, he says to trust your intuition. No test will provide a definitive direction in which to go—only suggestions that you can incorporate into your own wishes, desires, and life circumstances. Those same researchers who say that chance plays the biggest role in what you’ll end up doing also suggest that it may be the environment and culture around an occupation, not the actual tasks, that is most satisfying for those happy in their jobs—like the seven gold miners in Snow White who “whistle while they work.” (Though finding gold is certainly a satisfying task I would think.) Fourth, you’re the (only) one, and there’s no one like you. Even if you are an identical twin, your fingerprints are unique (bet you didn’t know that). So when it comes to test scores, you may not be the only one to get that score, but that’s the only score you get. Treat your unique qualities as important factors in consideration of the test profile that results from the testing. Finally, test-schmest—it’s only a score. There’s a lot more to deciding on a career path or choosing one track over another than one score. The more you know about yourself, the more useful the test outcomes will be. Just the thought of what profession is right for you probably creates some excitement but some anxiety as well. Putting your trust in a test to help you determine that might be hard to do, but remember, as Bolles says, that a test score is only one of many data points. Use all your resources to help you investigate and explore career opportunities.

FIVE CAREER TESTS Career and vocational tests really can help people identify themes in their work aspirations. These tests are not magic; they cannot identify exactly what people want to do professionally, but they can help. Look at the five tests in Table 11.1 and see how inclusive they are of so many different work themes and ideas that may be related to vocations. And look at how different the audiences are and the purposes of the tests, from the very often used Strong to the less often used but interesting Armed Services-Civilian Vocational Interest Survey (often given to high school students).

Chapter 11 

TABLE 11.1  Title/ Acronym (or What It’s Often Called) Strong Interest Inventory (Strong or SII)

Self-Directed Search (SDS)

Kuder Occupational Interest Survey (KOIS)

■ Career

Choices  205

  Five Widely Used Career Development Tests

Purpose and What It Tests

Grade Levels/ Ages Tested

The Strong’s purpose is to “identify general areas of interest as well as specific activities and occupations,” and it provides scores on six general occupational themes—realistic, investigative, artistic, social, enterprising, and conventional— plus interest, occupational, and personal style scales.

Ages 16 through adult

The SDS is a vocational inventory designed to identify “a person’s particular activities, competencies, and self-estimates compared with various occupational groups.”

Ages 12 through adult

The KOIS assesses “promising occupations and college majors in rank order, based on examinee’s interest pattern” and yields information on four scales—dependability, vocational interest estimates, occupations, and college majors.

Conceptual Framework

What’s Interesting to Note

Based on Holland’s occupational themes

1. Men’s form published and used since 1927; a women’s form followed in 1933. 2. Uses a Personal Style Scale to get some indication of leadership style and ability to work with people

Grade 10 through adult

The SDS is based on Holland’s personality typology as it applies to vocational preferences. Here, people’s vocational interests can be classified into the six categories of realistic, investigative, artistic, social, enterprising, and conventional.

1. Considers personality traits in decisions about how vocational choices are made

The KOIS measures interest in careers that require advanced technical or college training and is based on Holland’s model of six personality factors. The belief is that personality factors are related to job interest.

1. Test takers are presented with 100 triads of activities and asked to select the activity they like the most and the activity they like the least.

2. Loads of translations, including Chinese, Finnish, French, and Greek 3. Braille editions available

2. An audiotape is available to help users interpret results of the KOIS. 3. There are (yikes!) 119 occupational scales, 48 college major scales, plus the 10 traditional Kuder generic interest scales. (Continued)

206  Part II 

■ 

TABLE 11.1  Title/ Acronym (or What It’s Often Called)

Types of Tests

 (Continued)

Purpose and What It Tests

Grade Levels/ Ages Tested

Conceptual Framework

What’s Interesting to Note

Armed ServicesCivilian Vocational Interest Survey (ASVIS)

The test taker organizes interests and makes career decisions about military and civilian jobs in the following eight occupational groups: administrativeclerical-personnel, communications, computer and data-processing, constructionengineering-craft, mechanical-repairermachining, service and transportation, health and health care, scientific-technicalelectronic.

High school through adult

The rationale for the test design is that the objectives are based on the predictive validity of interests in future job satisfaction in those jobs.

1. Self-administering, self-assessing, and self-scoring

Campbell Interest and Skill Survey (CISS)

The CISS “measures self-reported interests and skills” yielding 99 scores in the categories of orientation (such as organizing and helping), basic (such as leadership and sales), occupational (such as hotel manager and ski instructor), and special and procedural scales.

Ages 15 years through adult

Primarily, the CISS focuses on the importance of selfreported interests and self-reported skills, both thought to be crucial in an individual’s career planning.

1. Contains 200 interest items (test takers rate their interest) and 120 skill items (test takers rate their skill)

2. Used to help explore the overlap between civilian and military jobs

2. The CISS special scales include interest and skill scores for academic focus, extraversion, and the variety of the test taker’s interests and skills. 3. Test results make up a CISS profile, a comprehensive, computergenerated report.

Chapter 11 

■ Career

Choices  207

VALIDITY AND RELIABILITY OF CAREER DEVELOPMENT TESTS Validity is a unitary concept, and all types of validity arguments and evidence can come into play in determining validity of a measure, but certain validity approaches seem particularly important for different tests with different purposes. Most of the major career choice tests are based on conceptual frameworks, such as researchbased distinct “types” of jobs, so construct-based validity arguments are important. And, if you think about it, these tests are aptitude tests, right? They are meant to predict happiness or success in a profession, so one would think that predictive criterion-based validity evidence should be available, like correlations between the test’s advice and which career a person ends up happy in? For reliability, like intelligence tests, these instruments supposedly measure a stable set of preferences or skills that shouldn’t change over time. So, test–retest reliability should be high.

Summary If you think that human behavior is complicated, you’re right. Just try to figure out what the different factors are that steer one person onto one career path and another person onto a completely different one. It’s tough at best, but very clever vocational psychologists and career counselors recognize that career directions depend on personality, values, interests, and other factors all working in tandem. And these same people have developed a darn good set of tools that provides a reliable and valid assessment—a terrific starting point—of the possibilities one should consider.

Time to Practice 1.

This is gonna cost you. Instead of ordering that double latte every day this week, take the $25 you will save and go to http://www.hollandcodes.com/self_directed_search.html#internet and take the Self-Directed Search (online!). It takes only 15 minutes, and you’ll quickly get back an extensive report. And, in all seriousness, we bet your instructor only wants you to do this if you can afford it. College is expensive enough!

2. What is the general rationale a developer of a vocational assessment tool might use, and why do you think it would work? 3. Check off which of the six Holland occupation themes characterizes each of these occupations—choose one or more than one. For more information, you can go online (https://www.hollandcodes.com/dictionary_of_holland_occupational_codes.html) or avail yourself of the physical book in the library, or you may just want to go back and look at our descriptions of the categories, which were presented earlier in this chapter.

208  Part II 

■ 

Types of Tests

Check Off the Occupational Theme Occupation

Realistic

Investigative

Artistic

Social

Enterprising

Conventional

Educational programming director Golf club manager Paralegal Lathe operator Airplane flight attendant

4. Now that you have done Question 3 and know the correct answers, write a one- or two-sentence description as to why a lathe operator (who runs a machine that shapes wood or metal) should be an RCE and a paralegal (a specially trained administrative assistant who works with lawyers) an SEC. 5. Identify 10 people you know pretty well, be they friends, parents, or other relatives, and identify what they do for a living (try to select people other than fellow students). How well do you think their professional activity reflects their personal characteristics? In other words, intuitively, how well do you think Holland’s system works? For Questions 6 through 8, visit O*NET Online at http://online.onetcenter.org/. 6. Choose “Career Cluster” on the page, and explore three different career clusters. Write down the five jobs that are most appealing to you. 7.

For each of the five jobs you wrote down in Question 6, write three tasks, three knowledge points, and three work activities associated with the job.

8. For each of the five jobs you wrote down in Question 6, write down the interest code and work styles of the job. Based on this information, rank each job according to how good a fit you believe it would be for you.

Want to Know More? Further Readings •

Gottfredson, G. D., & Holland, J. L. (1996). Dictionary of Holland occupational codes (3rd ed.). Lutz, FL: Psychological Assessment Resources.

This is the mom and dad of occupational codes, with thousands and thousands of occupations listed with their corresponding codes. Even if you don’t need to consult it as part of your professional work, it’s fun to glance around and see which themes characterize what professions. Anyone want to be a park naturalist (check the SRI codes)?

Chapter 11 



■ Career

Choices  209

Harmon, L. W., Hanson, J. C., Borgen, F. H., & Hammer, A. L. (1994). Strong Interest Inventory: Applications and technical guide. Stanford, CA: Stanford University Press.

This is all you need to learn all about the Strong Interest Inventory, and a very good place to start if you are interested in career or vocational assessment and counseling.

And on Some Interesting Websites •

This site lets you figure out your Holland code yourself for free. Give it a shot! Go to https://careerchoicer.com/holland-codes-riasec-choose-right-career.html.



The National Career Development Association, at http://www.ncda.org/, promotes career development across the life span and does so by providing lots of professional materials, information, and assistance to career development professionals.



The Princeton Review Career Quiz (at http://www.princetonreview.com/career-search) is one of many online tools that can be used to explore career choices.

And in the Real Testing World Real World 1 It’s always interesting to look at very early measures of a particular behavior and relate them to career decisions and choices. These authors examined several variables, among them attachment (a classic construct in developmental psychology) to parental figures early on. The model they constructed was a good fit with the data they collected, but not all aspects of the attachment outcomes were good predictors. Want to know more? Downing, H. M., & Nauta, M. M. (2010). Separation-individuation, exploration, and identity diffusion as mediators of the relationship between attachment and career indecision. Journal of Career Development, 36, 207–227.

Real World 2 It’s really fascinating how new ideas work their way through different disciplines. Neuroscience is one of the biggest ideas to hit the behavioral and social sciences in years. This article presents a neurobiologically based approach system where vocational interests are reviewed and integrated into a social neuroscientific model of career development and career interests. Want to know more? Hansen, J.-I., Sullivan, B., & Luciana, M. (2011). A social neuroscientific model of vocational behavior. Journal of Career Assessment, 19(3), 216–227.

Real World 3 Those who have experienced trauma have particular career counseling needs and may require individualized support for finding an occupation or transitioning from their current job. This recent paper provides solid guidelines for how counselors can help. Want to know more? Barrow, J., Wasik, S. Z., Corry, L. B., & Gobble, C. A. (2019). Trauma-informed career counseling: Identifying and advocating for the vocational needs of human services clients and professionals. Journal of Human Services, 39(1).

N

ow that you have the introductory material under your belt, are familiar with the basic “science-y” concepts, and know the main types of tests made and used by measurement professionals, it’s time to focus on the tests and measurement that are most common and with which you are actually most familiar—the tests you get in school! In Chapters 12 and 13, we will explore the two classroom assessment approaches—and there really are basically only two formats—selection items, where the correct answer is provided on the test, but must be selected, and supply items, where the students must construct or supply the answer.

PART III

CLASSROOM ASSESSMENT

Selection items, like multiple-choice and matching items, are objectively scored. No judgment is required for these sorts of items. Supply items, like essay questions and student portfolios, are subjectively scored. The teacher must use judgment or their expertise to evaluate the quality of what is produced. Scoring is often more complex for these formats and can require the use of a rubric or scoring guide. Here’s an example of a good, but pretty easy supply item: 1. Name three reasons this book is just about the best thing ever. (Use extra paper if needed.) You can see that there are many (many, many, many) correct answers. That’s what’s subjective about supply items. Another way to understand the two approaches we are looking at in Chapters 12 and 13, is that they tend to measure two different things—knowledge and skills. For basic knowledge, and lower-level understanding, selection items tend to work best. For skills, and deep understanding, supply items tend to work best, especially the kind of performance-based assessment and complex supply items we cover in Chapter 13. In working through the chapters in this part of the book, you should remember that your goal is to learn how to distinguish among the different types of test items (such as true–false and matching) and to understand how they are created and when they are best used. For example, assessing achievement might be better done using multiple-choice items (see Chapter 7), rather than using a writing sample (see Chapter 13), which would be best for assessing progress in a composition class.

211

212   Tests & Measurement for People Who (Think They) Hate Tests & Measurement

The most important thing to remember about the chapters in Part III is that part of your tour consists of many different ways to create, and use, test items. Learn about them, and even practice creating them, and you’ll be ready to better understand their value when the time comes, be you a test developer, test giver, or test taker.

12 PICKING THE RIGHT ANSWER Choose Your Own Destiny Difficulty Index ☺ ☺ ☺ ☺ (really pretty easy)

LEARNING OBJECTIVES After reading this chapter, you should be able to • Describe the four most popular objectively scored classroom test item formats. • Write a good multiple-choice item. • Write a good matching section. • Write a good true–false item. • Write a good short-answer or fill-in-the-blank item. • Argue a position on the validity and reliability of selection items.

G

uess what? The title of this chapter says it all. Selection items allow you to pick the answer from among known possibilities. This also means you can guess and have some chance of being right. So, how can a format that involves chance be so popular? Because, counterintuitively, this format is often the most reliable (!) and often the most valid approach to assessment. Intrigued? Read on.

213

214  Part III 

■ 

Classroom Assessment

YOUR OLD FRIENDS Let’s look at all those questions you either liked or hated (there don’t seem to be many people who fall in between), where the answer is either right or wrong and there’s just one right answer. (Yes, we know, some selection formats allow for more than one correct answer option, but just play along in order to keep things simple!) The most common objectively scored item formats are • Multiple choice • Matching • True–false • Fill in the blank and short answer How Is Fill in the Blank a Selection Item Format? Good question, smart aleck! And you are right—it’s not. ☺ But here’s our dilemma as we organize a book like this: The important difference from a measurement perspective between selection and supply items is that the scoring is objective for selection items and subjective for supply items. This affects validity and reliability. So (in our minds) this chapter and the next are really about objective scoring versus subjective scoring. And good fill-in-the-blank and short-answer items have only one correct answer, so they are objectively scored (even if technically the answer is supplied, not selected).

MULTIPLE-CHOICE ITEMS Multiple-choice items are the ones you see all the time in school. They are so easy to score, so easy to analyze, and so easily tied to learning outcomes that multiple-choice items are super common. But beyond all the other great things about multiple-choice items, they are hugely flexible. And by that we mean it is very easy to create an item that exactly matches a teacher’s learning objective or very specific chunk of knowledge. And, like all objectively scored item formats, teachers can avoid arguments with students about how many points they should get. (Usually!) So they like this format for that reason, too. In Chapter 7, we talked about the importance of taxonomies (such as Benjamin Bloom’s levels of understanding) and how these hierarchical systems can be used to help you define the level at which a question should be written. Well, while multiple-choice items allow you to write a question at any one of these levels, they work best at the lowest level of understanding, the memorized knowledge level. You can write them for higher levels, but it is somewhat more difficult. Why? Because multiple-choice items, by providing the possible answers, usually simply require recognition of the right answer and don’t force a student to create or organize a response. It is harder to measure critical thinking skills and creativity, which

Chapter 12 

■ 

Picking the Right Answer  

require deeper understanding, when “correct” answers need only be selected (and unique or innovative responses are not allowed). But want to assess tons of facts and introductory information? You can do that with multiple-choice questions easily, quickly, and efficiently.

Multiple-Choice Anatomy 101 Multiple-choice items consist of three distinct parts: a stem (the “question” or instruction that is being responded to), a correct answer (called the keyed answer), and several wrong answers (called distractors). Let’s look at the following item from a fifth-grade history lesson. We’ve labeled the different parts. 1. Who was the United States’ first vice president? [Stem] A. Washington [Distractor] B. Adams

[Correct Answer]

C. Jefferson [Distractor] D. Madison [Distractor] The key to a great multiple-choice question is a set of terrific distractors—those answer options that are plausible but are not correct. The idea with a distractor is that it looks appealing to students who know a little about the topic, but maybe haven’t done all the reading or homework they were supposed to. They shouldn’t be designed to fool someone who actually has the knowledge or has met the learning objective.

How to Write Multiple-Choice Items: The Rules Here is what research, theory, and (mostly) what the experts say are the most important guidelines when writing multiple-choice items. There are many rules about designing these items, but we will just list the biggies. As the great and powerful Oz said, the best place to start is at the beginning. 1. Use 4 or 5 answer options and make sure all the distractors are plausible. The idea is you want to make guessing as hard as possible and, the more choices, the less luck will play a role. And the less chance involved, the more reliable the test is, right? Could you use more than 5 answer options? Sure, theoretically, but it gets hard to write so many plausible distractors. 2. Don’t include negatives in the stem or answer options. For example, don’t ask, “Which person was not a vice president?” Some test takers may not see the “not”; plus, we don’t store information as things that are not true. We tend to store knowledge as positive true facts; that’s much more efficient. Not sure even our super complex brains have enough room to store everything that is NOT true. (Though we know plenty about Bigfoot, and it’s pretty likely that they’re not real.)

215

216  Part III 

■ 

Classroom Assessment

3. Stems should be complete sentences. This means they should end with a question mark or a period (or, we suppose, an exclamation mark). They shouldn’t simply stop in the middle of a sentence, as in, “John Adams was . . .” This sort of structure is exhausting and really works the brain because you have to read all the answer options to even know what is being asked. It’s best to make the question completely clear in the stem itself. 4. Avoid grammar hints and other clues in the wording that might give away the answer. This may sound like a no-brainer, but many multiple-choice items are poorly constructed in such a way that the test taker can easily figure out what the correct answer is or at least can eliminate some of the distractors. Some of us got through school by being test-wise this way and taking advantage of questions written sloppily. For example (and for those of you who are aspiring cooks), A mirepoix is a A. mixture of onions, carrots, and celery. B. ingredients for fondant icing. C. entrée served on Remulak. D. alternative to flour used in baking. The only reasonable answer based on grammar alone is A. All the other alternatives are grammatically incorrect, where the article a is followed by a vowel (such as i, e, or a) and B is also plural when the stem suggests the answer must be singular. See, you don’t even have to know a whisk from crème anglaise to answer the above question correctly. By the way, this item breaks another rule, too, right? The stem is not a complete sentence! 5. Items need to be independent of one another. Multiple-choice items need to stand alone, and the answer on one item should not inform the test taker as to what the correct answer might be on another item. For example, an item early in a test may provide a clue or even an answer to an item that comes later. These rules, by the way, apply to other selection formats, like matching items, which we talk about next. And there are many more rules we could cover (and there’s a lot more to know about multiple-choice items), but we will stop here. In this chapter, we are concentrating on only one type of multiple-choice item—the one where you are only allowed to select one answer option—but there are several other, more complex types of multiple-choice items that you may want to consider. Some multiple-choice items are context dependent, where the questions can be answered only within the context in which they are asked, such as when the test taker is asked to read a passage and then answer a multiple-choice item about that passage. Then there’s the best answer type of multiple-choice item, where there may

Chapter 12 

■ 

Picking the Right Answer  

217

be more than one correct answer but only one that is best. Both of these types of multiple-choice items may work quite well, but they should be used only with additional training and experience. If you’re just starting, stick to the basic type of multiple-choice items, where there is only one correct answer.

MATCHMAKER, MATCHMAKER, MAKE ME A MATCH Matching items are usually set up such that there are two columns—one column of what we could think of as short little stems and another column of many answer options. The directions are to match each stem with one of the answer options. (Sometimes stems are called premises in matching sections.) Sometimes the directions are general: “For each item in column A, indicate its match from column B.” Sometimes the directions are more specific: “For each U.S. state on the left, choose its capital city from the list on the right.” Whatever the instructions, the goal is to match ’em up. Basically, matching items are multiple-choice questions that share the same long set of answer options. A matching section on a test is used to assess a content area—be it history, biology, statistics, or the regulations governing NASCAR races. Like multiple-choice items and true–false items (which we will talk about next), matching sections involve selection, where the test taker needs to select one answer from a set of possibilities. With good matching items, all the answer options are within one area or topic and grammatically similar so that all the wrong answers work as good distractors. This makes guessing very, very hard! Here’s what a matching section might look like in an introductory statistics course. Directions: Column A contains brief descriptions of different measures of central tendency or measures of variability. Next to each item in Column A, put the letter of the item in Column B that matches the description. Answers may be used more than once or not at all.

Column A

Column B

___ 1. Sum of all scores divided by the number of scores

A.  The mean

___ 2.  T he most frequently occurring score

B.  The median

___ 3. The distance between the highest and lowest score

C.  The standard deviation

___ 4. The score in the exact middle of a distribution

E.  The mode

___ 5.  T he average reported when scores are at the nominal level

F.  The deviation score

___ 6. The average distance of each score from the mean

D.  The range

G.  The variance

(The answer key is 1A, 2E, 3D, 4B, 5E, and 6C. How’d you do?)

218  Part III 

■ 

Classroom Assessment

How to Write Matching Items: The Rules In addition to most of the guidelines we have already talked about with multiple-choice items (except that stems need not be complete sentences), there are a few suggestions specific to matching sections. 1. Have more answer options than stems. This will make it harder for students to guess the right answer correctly just by chance. It will also help clarify the distinctions between similar terms or facts or concepts within a topic. 2. Allow answer options to be used more than once or not at all. This helps in the same way as rule number 1. It makes it more likely that even test-wise students still have to actually know the answer to get it right. 3. Answer options should be shorter than the stems. This is actually a rule for multiple-choice items, as well. The idea is the stem gives enough structure for students to know the type of answer that is expected and they can scan answer options more quickly this way. When creating and organizing matching items, try to make sure that they appear in homogeneous groups—that is, groups that are similar in content and level of difficulty. For example, if the test contains matching items on physics and meteorology (a subset of physics), don’t mix questions from the two topics in the same set of possible answers.

ARE YOU LYING NOW, OR WERE YOU LYING THEN? Of all the types of items we cover in this part of Tests & Measurement for People Who (Think They) Hate Tests & Measurement, true–false items might seem like they are the easiest to write (because they are the shortest) . . . but beware. Short does not mean easy. Short can mean difficult to write, because you have little space to be very precise and say exactly what you mean. In other words, as your mother told you, “Say what you mean and mean what you say,” and this is perhaps more the case with true–false items than with any other type of item. True–false items are most often used for achievement-type tests when there is a clear distinction between the two alternatives, true and false. For example, the item A minute is 60 seconds. can be either true or false, but it cannot be both.

Chapter 12 

■ 

Picking the Right Answer  

One of the best criteria for judging the value of a true–false item is whether the correct answer (be it true or false) is, unequivocally, the right one, the only one, and the correct one (get the picture?). But here’s the difficulty with writing these sorts of items—super smart (read annoying) students can often find arguments to make about how a true statement might be false or a false statement true. Even with a “clearly” true statement like this one. If you’re studying time warps then perhaps your minute does not have 60 seconds or if you mean a minute as in “Mom, I’ll be down in a minute” (which could be a half hour), then it might not be clearly true. This is one reason that we might even suggest you never use this format—but true–false items are out there, and some measurement folks argue there are good reasons to include them in your arsenal, so let’s figure out how to use them well!

How to Write ’Em: The Guidelines As with the other objectively scored formats in this chapter, we’re going to give you the basics of how to write a true–false item. We’ve already emphasized that it is best to write a statement (not a question) that is absolutely true or false and not wishy-washy, so we won’t list that again here. 1. If you’re gonna use this format, really commit, and have lots of them! Theoretically, people have a 50% chance of guessing the correct answer, because there are only two possible answers—T or F. But across a bunch of these easy-to-guess items, good (or bad) luck will cancel itself out, so a total score of these items put together is fairly reliable. (Remember from Chapter 3 that reliability has to do with how much randomness is involved in the scoring.) 2. Have a roughly equal number of true and false items. It might be easier for you to write false items (or true items) and that pattern might be noticeable to the test taker. You don’t want to give any unintentional clues when you design an achievement test. 3. Avoid having the true statements longer than the false statements. To avoid arguments, teachers sometimes add extra words to true statements to help defend their “trueness.” And students might (consciously or unconsciously) just assume that sentences with more words in them are probably true. For instance, overly wordy true statements might look like this:

True or False: If you don’t count the earlier meetings of the Continental Congress, where there were chairmen who “presided” over the assemblies, George Washington was the first president of the United States.

4. Focus on one idea or fact in each statement. It is true for true–false items, and most other objectively scored formats, that one should avoid “doublebarreled” items, questions that measure two things, instead of just one.

219

220  Part III 

■ 

Classroom Assessment

These items are confusing and it is unclear which learning objective is being measured. Both these issues affect validity (which you will recall from Chapter 4 has to do with whether an item or test measures what it is supposed to). Lee Cronbach, one of the most famous measurement specialists in the history of the discipline, did a study in 1950 where he found that when students do not have any clues to what the right choice is on a true–false item, they will more often than not select “true.” That coupled with the fact that he also found most items on a true–false test are true (because these items are easier to write) leads one to believe that unless you intentionally create an equal number of true and false items for any test, you will not get a true or fair picture of performance using true–false items. Pretty cool observation, no?

SUPPLY ITEMS THAT SCORE LIKE SELECTION ITEMS ARE FILL IN THE ________? Short-answer and completion items (fill-in-the-blank items are examples of completion items) are used almost exclusively to assess lower-level thinking skills such as memorization and basic knowledge. If you want a test taker to demonstrate knowledge of the chemical symbol for hydrogen, a completion item such as The chemical symbol for hydrogen is ____. is the perfect item (with H being the correct answer). And a short-answer version of the item is essentially the same, it just is formatted differently: What is the chemical symbol for hydrogen? These questions are not selection items, technically (and nontechnically); they are supply items, because the answer is not provided on the test, but must be supplied by the student. We include them in this chapter, though, as they are scored objectively. There is one keyed answer that is acceptable. These are harder than selection items, though, right? One can’t really have any hope of guessing the right answer—you pretty much either know it or you don’t.

Selected Rules for These Objectively Scored Supply Items Here are some guidelines especially relevant for designing short-answer and completion items.

Chapter 12 

■ 

Picking the Right Answer  

1. If you have a choice, create a short-answer question rather than a completion question. Instead of saying,

The precursors of T cells leave the bone marrow and mature in the _____.



word this item as a question:



In what gland do T cells mature?



Why word it as a question? Questions are clearer and more straightforward, and they leave less room for ambiguity. For example, the answer to the completion or “fill-in-the-blank” item above could be “human body” and would be one of many correct answers. But the more specific and absolutely correct answer is “the thymus gland”—and the short-answer question much more clearly requires that type of response.

2. Avoid grammatical clues to the correct answer. Be sure that the structure and construction of the question make sense and do not inadvertently provide help. For example, take a look at the following item, which really clues the test taker in to what may or may not be the correct answer.

Hippocrates, the author of the Hippocratic oath, was trained as a _____.



Better would be



Hippocrates, the author of the Hippocratic oath, was trained as a(n) _____.



The correct answer is “mathematician.” The first item cues the test taker that the answer has to begin with a consonant (such as M for mathematician). The second item allows for an answer that begins with either a consonant or a vowel (such as E for elevator operator), decreasing the value of a grammatical cue and the likelihood of guessing.

3. Do not copy short-answer or completion items straight from the study material that test takers are using to prepare. This strongly encourages memorization beyond what is necessary (entire sentences and phrases rather than just facts). Although most short-answer and completion items focus on memorization, you want just the important information to be memorized, not everything verbatim. 4. For fill-in-the-blank items, use only one blank and put it at the end. Avoid items with lots of holes in them; they don’t provide enough structure to guide respondents as to what sort of answer is expected. For instance,

221

222  Part III 

■ 

Classroom Assessment

Hippocrates, the author of the _______________ _____, was trained as a(n) ____. has as many holes as Swiss cheese (and assessment specialists call these items Swiss cheese items) and most students wouldn’t know where to begin to provide the correct info.

VALIDITY AND RELIABILITY OF OBJECTIVELY SCORED ITEMS It’s funny. Depending on how you look at it, objectively scored items, like multiple-choice questions, would appear to have high validity and low reliability OR low validity and high reliability! So which is it? Well, it is both, or it depends, or neither, or whatever ambiguous answer you’ve learned to expect from us college teacher types. Here’s what we mean. Let’s start with validity. Because there is only one correct answer, if a student gives the right answer, that’s at least some evidence that they have the knowledge. A computer could score it, no judgment required, so the item clearly measures what it is supposed to. That suggests high validity. But this is true only if what it is supposed to measure is low-level knowledge, the memorized sort of information that does not reflect deep understanding of a concept. If the test developer believes they are measuring deep understanding, they probably are not. So, in that case, the test does not work for its intended purpose and that suggests low validity. When it comes to reliability, selection items allow for students to guess the right answer. In the case of true–false questions, they could guess half of them correctly just by luck. With 4-answer-option multiple-choice questions, they can get 25% of them right just by answering randomly. Scoring with lots of randomness to it (like being able to guess) has low reliability, so, for that reason, one could be concerned that selection items have low reliability. On the other hand, reliability is not just affected by how students respond but also by how the items are scored. There can be randomness in the scoring, too. Subjectively scored tests, like the performance-based assessments we examine next, in Chapter 13, can lead to inconsistency in judging how many points a response should get and evaluating the quality of performance. Different teachers can interpret scoring rules differently or apply them differently or have different expectations and even the same teacher on different days can score the same response differently. (The scoring can even depend, unfortunately, on whether the teacher likes the student or not.) So, that inconsistency adds randomness to the scoring as well. Objective scoring, though, has standardized automatic scoring (answers either match the key or they do not),

Chapter 12 

■ 

Picking the Right Answer  

so there is no randomness in the scoring! In that sense, we could expect reliability to be high for selection items. Whew! Selection items can be high or low in validity and reliability at the same time. How’s that for the sort of big-time discussion that can make your head hurt? You are becoming an expert in tests and measurement, my friend.

Summary When you see classroom tests in the movies and on TV shows, they almost always are multiple-choice tests or similar formats where students are confused and have to somehow read the mind of the teachers. This is consistent with the whole “tests are bad and unfair” attitude toward classroom assessment. Well, that might be true, there are certainly bad tests out there, but when done well, these objectively scored types of items are super reliable and work really well for their intended purpose—assessing basic knowledge of facts and information.

Time to Practice 1.

What are the three parts of a multiple-choice question, and what purpose does each serve?

2. What is wrong with this multiple-choice item? River otters are not closely related to a. Rats b. Skunks c. Sea otters d. Wolverines 3. What are the advantages of matching items over multiple-choice items? 4. Okay, your job is to assess middle-school students’ knowledge of basic biology, and one of your colleagues is arguing that true–false items are too easy and can’t fairly assess knowledgelevel information. What’s your best defense for the use of true–false items?

223

224  Part III 

■ 

Classroom Assessment

5. We included short-answer and fill-in-the-blank items with the other item formats in this chapter. Those other formats are selection items, while fill-in-the-blank and short-answer questions are supply items. So, what do they have in common? 6. What is wrong with this fill-in-the-blank item? A triangle with equal lengths for all three sides is called an ______ triangle.

Want to Know More? Further Readings •

Gierl, M. J., Bulut, O., Guo, Q., & Zhang, X. (2017). Developing, analyzing, and using distractors for multiple-choice tests in education: A comprehensive review. Review of Educational Research, 87(6), 1082–1116.

There is, in fact, some science behind the art of writing multiple-choice questions and this reviews just about all we know about creating good distractors. •

Vidwans, A., Gururani, S., Wu, C. W., Subramanian, V., Swaminathan, R. V., & Lerch, A. (2017, June). Objective descriptors for the assessment of student music performances. Paper presented at the Audio Engineering Society International Conference on Semantic Audio, Erlangen, Germany.

Can subjective topics be measured objectively? These authors study music and argue that even something as creative and artsy-fartsy as the performance of music can be assessed in an objective way.

And on Some Interesting Websites •

So much of our assessment world is online or on computers these days. Here’s a simple site for kids that is fun. At http://www.abcya.com/skeletal_system.htm, see how a matching format can be applied within an electronic context.



Not every one of the tools at https://elearningindustry.com/free-testing-tools-for-onlineeducation is terrific, but many can make your test development and scoring much easier.

And in the Real Testing World Real World 1 The type of item used in a test can actually reflect, quite accurately, what is learned. This study compared the effect of multiple-choice items against that of constructed-response items. Students took two forms of essentially the “same” test representing the two formats. A direct comparison of scores from the two tests showed that only 26% of people scored about the same, suggesting that the two forms measured different things.

Chapter 12 

■ 

Picking the Right Answer  

Want to know more? Currie, M., & Thanyapa Chiramanee, T. (2010). The effect of the multiple-choice item format on the measurement of knowledge of language structure. Language Testing, 27, 471–491.

Real World 2 Does taking tests actually help you learn more? Not usually with objectively scored formats, but this study looked at whether taking a type of true–false test where students were asked to make false statements true would help people remember the learned information. It turned out that it did, especially compared to simple note taking where students rewrite sentences (e.g., from the textbook). Want to know more? Uner, O., Tekin, E., & Roediger, H. L., III. (2022). True–false tests enhance retention relative to rereading. Journal of Experimental Psychology: Applied, 28(1), 114–129.

Real World 3 At last, multiple-choice questions are given a break after lots of criticism that using the correct answer as one of the alternatives can tend to encourage recognition rather than recall. These researchers tested whether multiple-choice tests, compared with cued-recall tests, could trigger retrieval using alternatives that were plausible enough to enable test takers to retrieve both why the correct alternatives were correct and why the incorrect alternatives were incorrect. Both testing formats helped retention of previously tested information, but multiple-choice tests also facilitated recall of information pertaining to incorrect alternatives, whereas cued-recall tests did not. Their conclusion? Multiple-choice tests can be constructed so they exercise the very retrieval processes they have been accused of bypassing. Want to know more? Little, J. L., Bjork, E. L., Bjork, R. A., & Angello, G. L. (2012). Multiple-choice tests exonerated, at least of some charges: Fostering test-induced learning and avoiding test-induced forgetting. Psychological Science, 23, 1337–1344.

225

13 BUILDING THE RIGHT ANSWER Construction Work Ahead Difficulty Index ☺ ☺ ☺ (Designing these items involves high-level thinking!)

LEARNING OBJECTIVES After reading this chapter, you should be able to • List two major classroom assessment formats that are performance based. • Design good constructed-response items. • Design a good essay item. • Create a really good rubric. • Explain the strengths of portfolios. • Argue a position on the validity and reliability of performancebased assessment.

T

raditional “paper-and-pencil” items, like multiple-choice questions, work well when measuring knowledge and learned information. And we spent all of Chapter 12 talking about that approach. But what if you want to measure student skills and abilities? It’s not an unimportant question, because nowadays teachers

227

228  Part III 

■ 

Classroom Assessment

probably focus on these learning objectives more than basic memorized knowledge (though multiple-choice questions aren’t going away any time soon). Well, to measure skills and abilities, which are invisible, teachers find ways to make them visible. They do this by asking students to perform (like give a speech or sing a song or conduct an experiment) or make a product (like a birdhouse, or a painting, or a written essay). This approach is often labeled performance-based assessment.

PERFORMANCE ANXIETY: MEASURING SKILL AND ABILITY Much of the classroom assessment that happens in the modern classroom doesn’t look like it. If you looked in the window of a fifth-grade classroom while assessment was going on, you’d be more likely to see students actually doing something—writing, designing, drawing, working together to build a robot, dancing, acting, debating—than sitting quietly trying to match stems with the right answer options. You will see students constructing a response to a test question or assignment or performing some skill. These constructed responses and performances are unique. There is not a keyed correct answer. Instead, responses vary in quality and teachers assign scores based on the level of quality. In fact, a special tool is used to score performance-based assessments—a scoring guide called a rubric. We will focus on two types of performance-based assessments because they are representative of the whole gamut of possibilities. We don’t have the room to discuss all the performance-based assessment that happens in biology classes (ever dissected a frog?) and gym (can you serve a volleyball?) and band (who will be first flute?) and math (prove this theorem!) and theater (thanks for your audition, but you might be more useful running lights for us). Instead let’s talk about constructedresponse items and essays.

PUTTING IT TOGETHER: CONSTRUCTED-RESPONSE ITEMS Constructed-response items are assessment tasks that ask students to create a complex product or written answer. We are thinking of products such as maps, graphs, reports, mathematical proofs, posters, diagrams, crafts, sculptures, and so on! Even written assignments like essays, research papers, short stories, and book reports, because they are created by students and have no one right answer, can be counted here. Constructed-response items are a type of supply item, right, not a selection item, because students supply the answer rather than selecting it. Teachers can use constructed-response items to assess knowledge and basic understanding, but this powerful assessment approach probably shouldn’t be wasted on simpler

Chapter 13 

■ 

Building the Right Answer  

stuff. Selection items do a great job of that. Constructed-response items are best used for measuring complex student skills and abilities. And, though we talk about them in this section of the book that focuses on classroom assessment, large-scale tests like many “state tests” required in the schools use constructed-response formats all the time and testing companies even have huge buildings full of people who are trained to score them.

What Constructed-Response Items Look Like By definition, a constructed-response item includes a described task, the stimulus or instructions or prompt that tells the student what they are supposed to do or make, and a response, the student’s “answer.” When used correctly, the response has more than one piece (which makes it complex) and requires several choices during construction (which reflects the student’s level of skill). They almost always require a rubric, or scoring guide, in order to be scored, and human judgment is necessary to score it meaningfully. Check out this constructed-response item: Here are some data. Make a chart that displays these data. Student ID

Test Score

1

 7

2

 8

3

 8

4

10

5

10

The answers are likely to look different from each other and there are many responses that are high-quality answers. Figure 13.1 shows three different responses. We don’t know the criteria used by the teacher for what makes for a quality response, but all three of these charts might get the same score. Or they might not. It’s almost as if we need a valid and reliable way of assigning points to constructed-response items! (Good thing we learn about rubrics later in this chapter.) This is a good constructed-response item because there are many components to score (which increases reliability) and the choices that students make in constructing the charts reflects skill and ability (which increases validity). Can Computers Make Judgments? Part of our discussion about constructedresponse items emphasizes that humans have to score them (unlike a multiplechoice test, which can be scored by a computer). But is that true? And if it’s not true, what does that mean in terms of validity? Well, for written essays, like the analytical essays required by tests like the SAT, it is intriguing to discover that software exists that can score essays and assign points to them in almost

229

230  Part III 

■ 

Classroom Assessment

exactly the same way as trained human judges do! The explanation given for this is that many of the criteria used for high-level writing, like complex sentences, a large vocabulary, essay length, and the use of words that suggest critical thinking, can be recognized by programmed algorithms. Though this isn’t the same as evaluating an underlying ability to write analytically, it correlates strongly. In other words, a good writer uses these elements, so we’d expect a lot of them in a sample of good writing. Does it matter what is actually reflected in a score, if it is ultimately the correct score? That’s for you and other philosophers to decide.

Designing Good Constructed-Response Items Here are a few important guidelines to follow when constructing this type of performance-based item. 1. Be clear on what it takes to get all the points. What does a good answer look like? What is quality? Many teachers decide on the elements of the quality necessary for a well-built product even before they teach the skills required. This way they can identify instructional objectives and plan classroom activities based on those criteria. 2. Share your scoring criteria with the test takers. This may be one time when it is okay to “teach to the test”! We talk later about the importance of

Possible Responses To Our Example Item

FIGURE 13.1 

Student Performance

Scores 7 20%

3 2

10 40%

1 0

Test Scores

Score of 7

Score of 8

8 40%

Score of 10

Test Scores 15 10 5 0 Student 1

Student 2

Student 3

Student 4

Test Scores

Student 5

Chapter 13 

■ 

Building the Right Answer  

using scoring guides, or rubrics, to score constructed-response items. There shouldn’t be anything secret about what a high-scoring “answer” looks like, right? The cool teachers share their rubrics with their students. 3. Give good instructions. Because constructed-response items are supply items and super open-ended, it might not be clear what even the general nature of the response should be. If a diorama representing the play Romeo and Juliet should include a balcony, then say so. It might not be obvious. If a drawing of the opening chapter of the book The Hunger Games should show Katniss, then make that clear.

DOING THINGS THE WRITE WAY Essay questions are perhaps the most unrestricted type of written assessment item we cover in Tests & Measurement for People Who (Think They) Hate Tests & Measurement. What you want to know is how well the test taker can organize information and express their ideas in writing. That’s why the really, really tough exams in one’s academic career are usually of the essay type, such as the written exams you often get in grad school for master’s and doctoral training (and even standardized tests sometimes, such as AP subject exams). These types of items tap complex abilities and skills so teachers use them frequently as part of a performance-based assessment strategy. Essay items are the item of choice if you want an unrestricted response and want to assess critical thinking, such as the relationship between ideas or the pros and cons of a particular argument. Essay questions come in two basic flavors: open-ended (also called unrestricted or extended) questions and closed-ended (also called restricted) questions. An open-ended (or unrestricted response) essay question is one where there are no restrictions on the response, including the amount of time allowed to finish, the number of pages written, or the material included. Now, it is a bit impractical to allow test takers to have 25 hours to answer one essay question or to write hundreds of pages about anything they like. So of course there are practical limits. It’s just that the limits do not define the scope of the response. For example, here’s an open-ended essay question. • Discuss the various theories of human development that have been addressed this semester. In your discussion, you may compare and contrast the basic assumptions of the theories, present criticisms of each, or use any strategy you’d like to demonstrate your understanding of these theories. Your paper should be 10 pages or less and is due in two weeks.

231

232  Part III 

■ 

Classroom Assessment

Now take a look at this closed-ended (or restricted response) essay question. • Compare and contrast the three basic theories of human development that have been discussed this semester. Your paper should be 9 or 10 pages and is due in two weeks. These two types of questions reflect different types of experiences. The first question, which is much less restrictive, gives the test taker a lot more flexibility (among other things) and allows for a more creative approach. The more restricted closed-ended question places definite limits on the content as well as the format. As worded, that second essay question could fairly easily be replaced by a nice matching section. You might think that almost everyone would like to have as much flexibility as possible, but that’s just not the case; many people like a very well-structured and clearly defined task assigned to them. So neither is better than the other—it just depends on whether the learning objective is memorized information or development of analytical writing skills.

How to Write Essay Items: The Guidelines Here are just a few guidelines that will be helpful when it comes time to write an essay question. 1. Allow adequate time to answer the question. By their very design, essay questions can take a considerable amount of time to answer. Regardless of whether an essay question is closed- or open-ended (remember, we have to be practical), you need to tell the test taker how much time they have to complete the question. And how much time should that be? Whether the essay question is part of a classroom test or a take-home assignment, keep in mind that essay questions require test takers to think about and then write the response. One strategy is to encourage test takers when they are practicing to plan their response by spending 30% of their time outlining or “sketching” the response, 60% of their time writing the response, and then the last 10% or so rereading what they have written and making any necessary changes (this is the tried-and-true “check your work” advice). 2. Be sure the question is complete and clear. This one sounds simple and it may indeed be, but sometimes essay questions are not very clear in their presentation. Want to know why? Because it’s not clear what the person writing the question wants to know. For example, here’s an unclear essay question. Discuss the impact of the Civil War on the economy of the postwar South.

Chapter 13 

■ 

Building the Right Answer  

It’s not like this question is that poorly designed, but it sure does not reflect a clear notion of what was learned or what is being assessed. This is the kind of topic some historian could write seven volumes about! Look how much clearer the following question is. Discuss the impact of the Civil War on the economy of the postwar South, taking into account the following factors: reduction in the workforce, international considerations, and the changing role of agriculture. 3. Essay questions should be used only to evaluate higher-order outcomes, such as when comparisons, evaluations, analyses, and interpretations are required. Want to know what 643 is? (262,144) The infant mortality rate in the United States in 2001? (6.9 per 1,000 live births) What the French called tomatoes? (Pomme d’amour for “apple of love”) If so, an essay question is not what you are looking for; you want the kind of item that tackles lower-level (but not necessarily low) thinking skills, such as knowledge or memorization. Performance-based formats are difficult to score with high reliability (see the Validity and Reliability of Constructed-Response and Performance-Based Items section toward the end of this chapter), so you should only use them when your goal is better validity (i.e., when you need to really get at those complex skills and deeper understanding).

QUALITY IS JOB ONE: RUBRICS Imagine scoring answers when they aren’t either right or wrong, but instead differ only in quality. That sounds a lot tougher than it is to score multiple-choice tests, right? It is, and there is a technology designed especially for scoring constructedresponse formats and all those performance-based assessments that measure skill development. That cool tool is a rubric. A rubric is a scoring guide designed to assess the quality of student-made products and performances that allows for the rating of multiple pieces using multiple criteria.

What Does a Good Rubric Look Like? A rubric is a written set of scoring rules, often in the form of a table. It identifies the criteria and required parts and pieces for a quality answer or a quality product. The relative weighting of the criteria and pieces are shown and the possible range of points for each component is given. Table 13.1 shows a typical rubric, this one for a seventh-grade art class, drawing and coloring a self-portrait. (I don’t know the author of this real-world rubric; it was taken from a great free website, Rubistar, at rubistar.4teachers.org, which has thousands of rubrics uploaded by real teachers around the world.)

233

234  Part III  TABLE 13.1 

■ 

Classroom Assessment

Scoring Rubric for Drawing a Self-Portrait

Category

3 points

2 points

1 point

0 points

Followed directions

The student completely followed all directions as announced by the teacher.

The student followed most of the directions.

The student followed very few directions.

The student did not follow directions or made no attempt to follow directions.

Technique and skill

The proportions of the face features are correct and the portrait was drawn well.

There are few mistakes with the proportions and the portrait was drawn satisfactorily.

There are mistakes with the proportions and the portrait was drawn okay.

The portrait was unacceptable.

Craftsmanship

The project was complete with no scribbling. The paper is neat and clean with no wrinkles in the paper.

The project was fairly complete with some scribbling and marks. The paper might appear to be wrinkled.

The project was poorly completed and the coloring looks messy. The paper appears dirty and wrinkled.

Little or no attempt at completion. The coloring is sloppy and the paper is messy and wrinkled.

Effort and behavior

Great effort was shown and the behavior was respectful.

Good effort was shown and the behavior was good.

Some effort was shown with fair to poor behavior.

No effort was shown and the behavior was unacceptable.

Notice the following nice things about this rubric that make it pretty good: • There are four different areas that the teacher evaluates—following directions, technique and skill, craftsmanship, and effort and behavior. As the expert, they have decided that these are the key indicators of quality when it comes to making a self-portrait. • A range of scores is possible (0 to 3) for each of these areas, which allows for awarding points for a range of levels of quality. The multiple indicators (which are like multiple items), along with a range of points possible, allows for greater validity in the scoring. • There are descriptors for each possible score that make expectations as concrete as possible. The more objective the scoring can be for performance-based tasks, the greater the reliability. Remember interrater reliability from Chapter 3? Subjective scoring leads to randomness, which lowers reliability. So, defining what each score means in a range of possibilities helps make scoring more precise.

Chapter 13 

■ 

Building the Right Answer  

What’s So Great About Rubrics? Rubrics help make up for the potential weakness of performance-based assessment—the subjective nature of the task means judgment is necessary when assigning scores. In addition to improving the reliability of performance-based assessment, though, rubrics help teachers in other ways. 1. Rubrics allow for quick scoring and quick feedback. Rubrics can be used to provide frequent formative feedback, so students can monitor and adjust their own learning. This means teachers spend less time grading and more time teaching. 2. Rubrics improve teaching. The process of rubric development helps teachers to analyze and identify what is most important by explicitly focusing on those characteristics that define high-quality performance. 3. Rubrics encourage the growth of student meta-cognitive and critical thinking skills. Students who participate in rubric creation and application start to think about their own learning and develop the ability to judge for themselves the quality of their own work. 4. Rubrics allow for meaningful sharing of student growth. Teachers break learning goals down into specific skills, components, or criteria of quality when conferencing with parents.

MORE THAN A NUMBER: PORTFOLIOS Tests (you know—you come into a room and take one) are not necessarily always the best way to evaluate an individual’s performance. And assessing a student’s ability can never really be done well by boiling everything down to a single score. What is sometimes needed is a three-dimensional, high-definition, 5G collection of many examples of a student’s work. That, our friends, is a portfolio. Historically, portfolios have always been used to evaluate ability in the arts—both performing and creative—but for a generation now, they have been used in the classroom, as well. They benefit teachers and students and parents and administrators. That’s pretty much everyone, which explains their popularity. A portfolio is a collection of work that shows effort, progress, and accomplishment in one or more areas. For example, a student’s lab manual, plans for experiments, results of experiments, data, ideas about future research, and lab reports might make up a portfolio in a chemistry class. Or a student’s poetry, journal, essays, and impressions of other students’ work might make up a portfolio in a creative writing class. Want to evaluate how well a student in social work interviews a client? Why not have them tape the interview and then include the taped session in their portfolio

235

236  Part III 

■ 

Classroom Assessment

with other interviews, written reports, and maybe traditional tests, and then you can evaluate their overall set of competencies? The student learns so much more about what they are doing right and wrong. Think you could design a single test that would accurately evaluate student performance as a social worker? Probably not. As a means for assessment, portfolios are clearly different from other, more traditional methods such as achievement tests, which examine one’s knowledge of an area. Where a traditional achievement test is more or less bound by time and content (such as introductory earth science), a portfolio allows the student and the teacher to expand the format of the material being evaluated. But, of course, along with that comes a price—the effort needed to create a portfolio and the effort needed to evaluate it. Portfolios take a great deal of time and energy to design and then create, and it takes an equally long time to fairly judge each element in the portfolio. Table 13.2 contains a summary of the advantages and disadvantages of using portfolios. Educational researchers who spend a lot of time thinking about classroom assessment tend to fawn all over portfolios because they might be fairer for all students compared to, say, a multiple-choice test or a single essay assignment, and they involve students in the assessment process, which can actually increase learning. Here are some of the characteristics of portfolios that make them “evidence based,” as scientists like to say: • A good portfolio is both formative and summative in nature. This means that the evaluation is continuous—the efforts and accomplishments are recognized as the portfolio is being created (say, every 3 weeks or every four elements of the portfolio)—and summative in that there is a final evaluation. • A portfolio is a product that reflects the multidimensional nature of both the task and the content area. For example, the art student who is applying TABLE 13.2 

Advantages and Disadvantages of Using Portfolios

Advantages of Using Portfolios

Disadvantages of Using Portfolios

• They are flexible.

• They are time-consuming to evaluate.

• They are highly personalized for both the student and the teacher.

• They do not cover all subjects well, nor can they be used with all curriculum types.

• They are an attractive alternative to traditional methods of assessment when other tools are either too limiting or inappropriate.

• The scoring of portfolios is relatively subjective.

Chapter 13 

■ 

Building the Right Answer  

to a college-level art school has to assemble a portfolio that reflects their interests and abilities. The task at hand is to exhibit one’s ideas, so the student should think broadly in creating that portfolio and include lots of images of different media (drawings, clay creations, paintings) and not limit the content to, say, functional pottery (how about some sculpture?) or abstract drawings. The portfolio invites the student to be expressive and think both differently and big (in size and ideas). That’s the beauty of the portfolio as an evaluation tool. • Portfolios allow students to participate directly in their own growth and learning. While being closely monitored (by a teacher or supervisor), the student can participate (with feedback) in the process of creating each element and gets to think and consider the direction in which their work is going and make adjustments as they go. This can even be done with young children, research shows. • Finally, portfolios allow teachers to become increasingly involved in the process of designing and implementing curriculum. In many educational settings, teachers are told what they need to teach and even how. With use of portfolios, the what becomes much more a part of the teacher’s everyday activities, and the how results in a close integration of classroom activities and each teacher’s own philosophy. You can read here what a good portfolio is. But it is very important to remember that a good portfolio is not just a collection of an individual’s work. It’s a systematically organized and documented group of elements and goals that meets predefined objectives. And it includes a final phase where students and their teachers reflect on what has been accomplished and, ideally, can see progress and growth over time!

VALIDITY AND RELIABILITY OF CONSTRUCTED-RESPONSE AND PERFORMANCE-BASED ITEMS Open-ended supply items, especially ones as complex as performance-based items, are known for high validity. And they certainly tend to be more valid than a multiple-choice question. This thinking, though, is based on the assumption that one wants to measure skills and abilities and deep understanding of processes. Making skills visible would seem to require that people do something to display those skills. And asking students to perform (like give a speech or play the piano) or build a product (like writing a persuasive essay) is a good way to get them to do something. The use of a scoring rubric that has made concrete (as much as possible) what to look for while scoring student responses is an even better way to validly judge skill and ability.

237

238  Part III 

■ 

Classroom Assessment

Those same rubrics that help with validity, though, lead to subjective scoring. And when a measurement person sees “subjective,” they think “random.” The subjective use of even well-written rubrics is a problem when it comes to reliability. We talked in Chapter 12 how selection items like matching or multiple-choice formats do great when it comes to reliability, because there is no randomness (or very little, at least) in the actual scoring part. There is randomness in how students respond, which might hurt reliability, but generally speaking, selection items have greater reliability than supply items, especially subjectively scored items like construction items and other performance-based formats.

Summary If you want to assess deep understanding of a concept or skills and abilities, it is almost impossible to do with a bunch of multiple-choice questions. Right? You have to ask people to make something or  do  something you can observe. Constructed-response items and performance-based assessment allow you to make invisible skills visible. And the good old essay assignment works well to assess writing ability, as well as depth of understanding. Teachers can even “see” development and improvement and change with portfolios that students collect.

Time to Practice 1.

What is the general value of using essay questions as part of (or as) an entire test?

2. In your area of interest, write one stunningly terrific essay question. Then exchange with a classmate and evaluate their question according to the guidelines presented in this chapter. 3. What fields have used portfolios for a long time before general education caught on to their value? 4. What are some advantages of portfolios? 5. Why are constructed-response items called constructed-response items? 6. Constructed-response items work well when teachers want to assess skill or ability. Why are they better than even well-written multiple-choice questions for that purpose? 7.

What are important characteristics of well-designed rubrics?

8. The subjectivity of rubrics makes for relatively low reliability. What are some ways to increase reliability for rubrics?

Chapter 13 

■ 

Building the Right Answer  

Want to Know More? Further Readings •

Wolf, T. J., Dahl, A., Auen, C., & Doherty, M. (2017). The reliability and validity of the Complex Task Performance Assessment: A performance-based assessment of executive function. Neuropsychological Rehabilitation, 27(5), 707–721.

Executive functioning is the ability of adults to make decisions and regulate their daily lives. When people have strokes, their executive functioning is often disrupted and recovery depends on valid and reliable data on their ability to perform these tasks. This study analyzed the validity and reliability of an assessment designed for that purpose. Because it is a subjectively scored test, interrater reliability is particularly crucial and this test was found to have extremely high interrater reliability. •

Jhangiani, R. S. (2016). The impact of participating in a peer assessment activity on subsequent academic performance. Teaching of Psychology, 43, 180–186.

The purpose of this study was to examine the impact of participation in a peer assessment activity on academic performance. But for our purposes, the really interesting aspect was that participants were asked to take a short-answer test as well as to write two short essays. Just another example of how essay questions can supplement other assessment methods, in addition to standing on their own.

And on Some Interesting Websites •

What could be better than a sample essay selected from the best essays on writing ever written (really?). Take a look at https://www.flavorwire.com/429532/10-of-the-greatest-essays-onwriting-ever-written.



Computers can grade essay questions? What does this high school newspaper think of that! https://ihsjournalism.online/1815/features/what-happens-when-computer-grades-our-essays/.



Ever hear of the IELTS? It’s the International English Language Testing System, which is designed to assess the language ability of those who need to study or work where English is the primary language. It’s widely used in the United Kingdom (and in the United States) as an admissions test for college. How does it work as a performance-based assessment? Find out at https://www.ielts.org/en-us.

And in the Real Testing World Real World 1 Portfolio assessment is a type of classroom assessment that is called formative assessment because feedback is given while learning is still happening. As such, this type of assessment can actually help students learn and grow, not just assess the learning and growth after it has happened. A physics teacher in Indonesia wondered if portfolio assessment could improve students’ attitudes toward science. The results showed that attitudes like curiosity, respect for the scientific process, and so on, all became more positive for the students who were involved in portfolio assessment, but not so much for those who only took traditional assessments. Want to know more? Wartawan, P. G. (2017). The effectiveness of the use of portfolio assessment by controlling prior knowledge to enhance scientific attitude among senior high school students. International Journal of Physical Sciences and Engineering, 1(3), 9–18.

239

240  Part III 

■ 

Classroom Assessment

Real World 2 One characteristic of constructed-response items that we said was a good thing was that students had to actually do something. With computer-based assessments, the number of actions actually taken while constructing a response (clicking, moving the mouse, making changes) can be counted and researchers wondered if that might give some insight into level of skill. What they found was that students who do more while responding to a performance-based online assessment are likely more engaged and that might affect performance. Want to know more? Ivanova, M., Michaelides, M., & Eklöf, H. (2020). How does the number of actions on constructed-response items relate to test-taking effort and performance? Educational Research and Evaluation, 26(5–6), 252–274.

Real World 3 The majority of high-stakes tests from elementary school through postsecondary education include an essay as a measure of writing performance, and for a student with a writing-related disability, this can present a significant barrier. This study investigated the influence of handwritten, typed, and typed/edited formats of an essay and the contribution of spelling, handwriting, fluency, and vocabulary complexity to the quality scores. Results showed that vocabulary complexity, verbosity, spelling, and handwriting accounted for more variance in essay quality scores for writers with dyslexia than for their peers without dyslexia. Want to know more? Gregg, N., Coleman, C., Davis, M., & Chalk, J. C. (2007). Timed essay writing: Implications for high-stakes tests. Journal of Learning Disabilities, 40, 306–318.

Y

ou just finished learning about classroom assessment, a situation where regular people (like teachers, to be specific) design and build their own tests. Well, another situation where regular people make their own measures, instead of using a professionally produced, standardized instrument, is in the world of social science research.

PART IV

RESEARCHERMADE INSTRUMENTS

Real-world researchers routinely measure their variables of interest using methods they designed themselves. This often takes the form of surveys to ask people questions about some fuzzy, hard-to-see construct, like an attitude or a personality trait or a perception. But, sometimes, researchers make their own achievement tests (like in Chapter 7) to measure knowledge or skill and other homemade tests similar to those described in Part II. It is likely you are reading these words because you are training to do social science research or, at least, engage in social science data collection and interpretation. At this point (if you have read the chapters roughly in order), you have the skills to make your own measures. And if you need a test that measures learning, aptitude, cognitive skills, and so on, we have already introduced the methods for how you could make your own instrument that produces scores with validity and reliability. And we have even talked about the ways you could demonstrate validity and reliability for your own measurement approach. But we haven’t, yet, talked about surveys, a tool used by researchers all the time. Chapter 14 covers this important topic.

241

14 SURVEYS AND SCALE DEVELOPMENT What Are They Thinking? Difficulty Index ☺ ☺ ☺ (some art and some science)

LEARNING OBJECTIVES After reading this chapter, you should be able to • List the steps of survey construction and scale development. • Write really good survey questions. • Compare and contrast different attitude measurement strategies. • Apply best practice for ensuring a good survey response rate.

W

e give a lot of thought in terms of what should be covered in an introductory textbook to the world of tests and measurement. And in earlier editions of this book, we were fairly content with the wide variety of topics and test types and measurement approaches we were able to include in the limited space we are working with. (And publishers care about length—we still haven’t found anyone to distribute our six-volume Strengths and Weaknesses of Using 5 Multiple-Choice Answer Options Instead of 4.) But we realized that we were missing a key example of how measurement is used by researchers and, other than classroom tests,

243

244  Part IV 

■ 

Researcher-Made Instruments

probably the most common type of social science measurement with which you are familiar—the survey or questionnaire. So, our apologies, and here now, for your reading pleasure, a brand spanking new chapter just for you on survey research!

SURVEYING THE LANDSCAPE Questionnaires and surveys (and let’s treat these two words as meaning essentially the same thing) are organized collections of questions used by researchers to get lots of answers and lots of data quickly and easily. The word survey in English is used in many different contexts—to observe someone carefully, to map something (like a piece of land), and to get a general view of something. And questionnaires basically do all those things. From a scientific standpoint, these measurement instruments collect data in two ways. They may use a set of unrelated questions to gather demographics and biographical information about people in order to describe them. This is basically a collection of facts. Or they may be interested in measuring some abstract concept or construct (see Chapter 4 about construct-based validity) like attitudes and feelings and personality traits. When these abstract fuzzy concepts are being measured, instead of unrelated questions, surveys tend to use scales, groups of questions that all tap into the same thing. Scales combine the responses to the many questions (or the scores that represent them) to create a single total score that represents the concept. Scales produce scores that tend to be more valid (because they each measure different “slices” or aspects of the construct) and tend to be more reliable (because any randomness in responses tends to be minimized) than asking a single question. That makes sense, right? Concrete concepts, like factual information (like where were you born?), can easily be measured with a single question, but getting at abstract constructs (like your happiness) would benefit from many observations combined. The steps to developing a valid and reliable scale are similar to the steps we discussed in Part II for the various types of tests, like achievement tests and personality assessments. Here is a summary of that process. This particular series of steps is from the work of Robert DeVellis and his fine series of books on scale development.

Steps in Scale Development 1. Determine clearly what it is you want to measure. Define the construct. Describe someone who is high in this characteristic and someone who is low in this characteristic. a. For research variables that are abstract constructs, use theory as an aid to clarity. When you are measuring elusive phenomena that cannot be directly observed, you will benefit from using an existing conceptual framework developed over time by other researchers. No need to “reinvent the wheel” if other researchers have defined the construct for you already (and it matches the definition you wish to use).

Chapter 14 

■ 

Surveys and Scale Development  

b. To help in writing items, use specificity as an aid to clarity. Do you wish to measure a broad, general construct or a context-specific construct? 2. Generate an item pool. Write lots of questions that seem to get at your construct. What can you ask to make the invisible visible? You’ll be revising this pool later and getting rid of items that don’t work well, so there is no need to worry about whether a question belongs on the scale for sure or not. 3. Determine the format for measurement. We will talk about some popular options later, but will the questions literally be questions, or, for example, will they be statements that people agree or disagree with? Or will they be tasks that a person must perform? 4. Have item pool reviewed. Are there experts in the topic, such as researchers who know a lot about what your research variable is and how it should be measured? Have them evaluate your questions. If you are hoping to measure social anxiety, consider emailing a researcher in that area and asking if they will review your pool of potential items. Or get some representatives from the population you hope to give your scale to and hold a focus group where they take your survey and you check in with them about how they interpreted your wording and what they were thinking about when they answered. (With a focus group like this, you don’t actually care what their answers were, just how they came up with them.) 5. Consider inclusion of validation items. In Step 6, you will try out your scale. For that step, you might include an existing measure that assesses something similar to your construct. Later, you can correlate scores from your scale with that other measure and if the correlations are high, that counts as validity evidence that you are probably measuring what you think you are. There are lots of ways to validate scores from a measure, but a correlation with a similar measure is pretty persuasive. For many, it is the strongest evidence possible (though measurement scientists could suggest better arguments that might be made). 6. Administer items to pilot sample. Give your survey to a sample of people similar to those that you are actually researching. If the sample is big enough, you can do reliability analyses and, maybe, produce correlational “validity coefficients” to see if you are measuring what you think you are. 7. Evaluate items. Can you increase the internal consistency reliability by removing a bad item or two? Or do you need to add some items to better measure your construct? Did it take your respondents too long to take your survey and many quit before the end? Did everything work

245

246  Part IV 

■ 

Researcher-Made Instruments

technically with your online data collection? This is your chance to make changes to improve your scales and the entire survey instrument. 8. Produce your final scale. After you evaluate the data from your pilot administration, you can refine your scale, maybe tweak some wording here and there, or conduct a massive overhaul (which might force a second pilot stage). But eventually, you’ll be done and ready to use your very own handmade instrument for a research study!

HOW DO I ASK IT IN THE FORM OF A QUESTION (FOR $200, ALEX)? There is some science and theory behind smart ways to format and compose questions for your survey. Your goal is to get accurate answers (to questions that ask about factual information) and honest answers (to questions that ask about feelings or behaviors). Here are some important suggestions for getting valid and reliable responses in your research. Questions About Facts: Where Were You on the Night of the 7th?! • Nonthreatening Questions

All questions either feel threatening (you better be careful what you say) or nonthreatening (no one cares what you answer). If you are gathering factual information and there is no reason that people might lie, research shows that it helps to �

Make the topic salient. If you are asked about something that was or is important to you, you are going to remember and retrieve the information more vividly. For instance, What were you doing during the morning of the day that your first child was born? is a question that is likely to get accurate answers from both moms and dads.



Give reminders of the context of the question. Instead of Tell me about the concert you went to last year, you would probably get better info by saying something like, Tell me about the folk-rock festival you went to last year. Remember it rained that morning and the opening band was Uncle Dirtytoes?



If you want to know about typical behavior, ask about recent behavior. Ask, How often did you eat breakfast this week? instead of How often do you eat breakfast?

Defining Saliency. Accuracy is helped by saliency. sa·lien·cy noun \sey-lyən(t)-sē, sey-lē-ən(t)-sē,\ the characteristic of being outstanding, striking, memorable

Chapter 14 

■ 

Surveys and Scale Development  

• Threatening Questions

Those responding to survey questions, especially about behavior, may be paranoid that their answer will be “bad.” That is, they may feel that there is a socially desirable response—an answer that other people think is morally correct or normal. So, they may be tempted to fib about their behaviors if they judge that they might be viewed negatively. Many strategies have been developed to ask about behaviors that people might not report truthfully. �

Use long questions. By reading through some detailed preliminary text describing the behavior in question, respondents can come to feel that their own behavior is acceptable (because, after all, the researcher is treating the question like there might be a variety of okay responses). Instead of, In the last 30 days, how often have you spanked your child? consider the effect of asking it this way: Parents do many different things to discipline their children and not all parents would agree on what the best disciplinary practices are. Some parents feel spanking is a useful disciplinary practice; other parents do not. In the last 30 days, how often have you spanked your child?



Use “loaded” questions. With threatening questions, this is the one time when it’s okay to lead the witness. Questions worded in a way to influence a certain response are called loaded questions. That is, give them every encouragement to “admit” that they engage in the behavior. It is unlikely that someone who has never engaged in the behavior of interest will lie and say they have, whereas it is possible that, without encouragement, those who have done something that they fear is wrong will lie. (You know you have!) Here are some strategies for writing leading questions:

• Claim that everybody does it. Most parents occasionally spank their children. How many times during the last week did you spank your child? • Use nonspecific authorities to justify the behavior. Most experts believe it is useful in some circumstances for parents to spank their children. How many times during the last week did you spank your child? • Provide good reasons for the behavior. Some college professors don’t seem to realize how busy students are these days and continue to require homework that is not much more than “busywork.” Have you ever found it necessary to ask a friend for a copy of their completed homework when the homework assignment was really just busy work?

247

248  Part IV 

■ 

Researcher-Made Instruments

Safely Telling the Truth A very clever technique has been devised that allows people to be honest in answering a threatening question because absolutely no one will even know what question they answered! Here’s how random responding works. You are given two questions: 1. Were you born in January? 2. Have you ever shoplifted from a store? The first question is nonthreatening, right; no one would lie when answering that question. The second question, though, is threatening and has a clearly socially desirable answer (no). So, you give these instructions: Flip a coin. If it comes up heads, answer question number 1. If it comes up tails, answer question number 2. In a room full of strangers, after following the directions, you could even give an answer out loud (we have done this in our classes), and you would certainly feel safe giving the answer on an anonymous survey. Because even if someone sees your answer, they won’t know which question you were answering. But how does the researcher know which “Yesses” apply to which question? Here’s how the math works. Suppose 100 people take your survey and you get 34 Yesses. How many of those 34 Yesses were from people who were answering the shoplifting question? You know that about half of your respondents answered Question 1 about the month in which they were born. So about 50 people answered Question 1. By chance, we would expect 1 out of 12 people to be born in any given month (except for February, but never mind that). So, 1 out of 12 is about 8% and out of 50 people, we should have about 50 × .08 = 4 Yesses. Still with us? So, that means that of our 34 Yesses, about 4 of them were answering the first question. And, it follows then, that 30 of them were answering the second question! Remember that about 50 people answered Question 2. (And we are almost done . . .) Thus, 30 out of 50 people said they have shoplifted. That is 60%. We have a fairly accurate estimate of shoplifting rates (though we made up this example) without anyone ever feeling at risk for being found out. A little complicated, but such a smart trick!

Questions About Attitudes: You Feel Me? Follow the strategies described above for gathering concrete facts and you can do a dandy job collecting accurate information! But what if the things you are trying to measure are more abstract, not truth but opinion, not behavior but feelings? Those sorts of fuzzy questions appear on attitude surveys. And there are a bunch of good rules for writing valid attitude questions. Interestingly, as we will explore later, these “questions” are often, actually, written as statements, with which people are asked to agree or disagree. So don’t be confused if our example questions end with a period, not a question mark. Finally, all this advice assumes the questions

Chapter 14 

■ 

Surveys and Scale Development  

are nonthreatening; if they are threatening, well, we covered that problem earlier in this chapter! Here are a few particularly important guidelines for writing valid attitude questions: • Word questions as simply as possible. You want to get to the point quickly and precisely with attitude items. Instead of The Socrates Project, a high school mentoring program, has been widely adopted in Kansas and tries to keep at-risk students in school by assigning faculty to work with them. Overall, when I think about the variety of programs that could be available to our kids, I support the project. Try The Socrates Project provides a faculty mentor for at-risk students. I support the project. • Keep the question balanced. For these types of questions, you don’t want to lead people one way or another. Here’s what we mean . . . Instead of Do you favor full inclusion for students with cognitive disabilities? Try Do you favor or do you oppose full inclusion for students with cognitive disabilities? • Avoid double-barreled questions. Poorly written questions sometimes ask about two different things at once. Instead of The training was informative and interesting. Try The training was informative. The training was interesting. • Avoid the use of negatives in questions. A little extra “not” in a statement is easily missed and could lead to invalid responses. Instead of Convicted felons should not be allowed to vote. Try Convicted felons should be allowed to vote.

OUR FEELINGS ABOUT ATTITUDE: MOST LIKE LIKERT (BUT SOME LIKE THURSTONE MORE) By far the most common format for attitude items on surveys looks something like this: 1. I like peanut butter.

Strongly



Disagree   Disagree   nor Disagree   Agree   Agree



1

Neither Agree 2

3

Strongly 4

5

249

250  Part IV 

■ 

Researcher-Made Instruments

Sometimes the words are different or the number of answer options is different or there is not a middle option, but this general strategy of providing a statement and asking people whether they agree or disagree on some continuum is called a Likert format. The approach is named after Rensis Likert (pronounced LICK-ert; there’s our contribution to the ongoing textbook authors’ debate on how to pronounce his name), a psychologist, who in the 1930s suggested both this format and the idea that one could create a group of such items (the idea of a scale) and combine responses to get a valid and psychometrically sound measure of feelings and attitudes. What makes an item a Likert type of item is that the answer options are symmetrical (an equal number of positively and negatively perceived answer options) and balanced (an effort is made to make the “distance” between each answer option about equal). Scores from measures, of course, are used by researchers to conduct statistical analyses and answer research questions, and there is some debate among statisticians about the proper way for researchers to use scores from Likert items. The debate centers around the concept of level of measurement, which you will recall from Chapter 2 (which was literally all about that topic). A higher level of measurement gives more information and one of the highest levels of measurement is interval level. Intervallevel measurement means that the scores from some measure are such that the intervals or distances between any two adjacent (side-by-side) scores is equal. And that this equality is true everywhere along the range of scores. This matters because if scores are at the interval level, then it makes sense to summarize them with means and standard deviations and compare them with t tests and analyses of variance and correlate them with each other and use all sorts of fancy statistics on them. One might think, well, great, because Likert items are designed to be balanced in that equal-interval way so they are perfect for these types of statistical analyses. The problem is, though, that a single Likert item probably isn’t equal interval. Look at the example we used earlier. The distance (psychologically speaking) between strongly disagreeing and just disagreeing might be equal to the distance between strongly agreeing and just agreeing (or it might not, who knows?), but is that distance equal to the difference between agreeing and having a neutral attitude (or no attitude whatsoever) as indicated by “neither agree nor disagree”? Maybe, but probably not, right? So a single Likert item doesn’t seem to meet the definition of interval-level measurement and these powerful statistical methods that require that high level of measurement shouldn’t be used. Likert, though, was all about combining a bunch of these items into a scale to create a total score that summarizes those item responses. That total score is not only likely more valid (see Chapter 4) and more reliable (see Chapter 3) than a single item’s score, but it also comes a lot closer to meeting the needs of statisticians in terms of being at that elusive equal-interval level. So, there are lots of good reasons to use scales (groups of items) to measure a construct instead of a single item, even if that single item is beautifully written!

Chapter 14 

■ 

Surveys and Scale Development  

What do real interval scores look like? If statisticians want us to measure at the interval level, fine, but how can we know we are at that level of measurement? We can’t see into the minds of our research participants (otherwise we could skip that measurement step). It turns out that the real key is that the scores we use for high-level statistical analyses need to be normally distributed. That is, they need to be from a distribution of scores that are shaped like the well-known bell curve, as defined pretty precisely in Chapter 5. Conveniently, if scores are normally distributed, then we can assume they are at the interval level (or so we are told by our statistics colleagues)!

The Thurstone Method There is another approach to attitude measurement that statisticians find much more satisfying than the Likert approach in terms of producing interval-level scores. It is not used much these days, maybe because it takes twice as much work and twice as much time to develop. It’s pretty clever, though, and probably is best practice (even if it is the rare researcher who uses it), so let’s learn how to do things the Thurstone way. Louis Thurstone, another social psychologist (like Likert), was the first to develop whole theories about what an attitude is and how to measure it. In the late 1920s, he suggested that one could create a bunch of attitude statements all referencing the same thing (an “attitudinal object”) and ask people if they agree or disagree with the statements. This is similar to what Likert suggested a few years later in terms of a scale of items that all sample the same construct. The difference is that Thurstone items are all worth a different amount of points, depending on how strongly stated the attitude is. With Likert, everything is worth the same—if I agree with just two statements that are only mildly strong, I score the same as someone else who agreed with two different items that were strongly worded. We actually have a different “amount” of the attitude, but are treated as if we have the same amount of the attitude. (See why statisticians think of Likert scores as not equal interval?) Here’s how to create a Thurstone scale: 1. Choose an attitudinal object that you can write opinion statements about. (Your construct is “attitude toward ______.)

Example



Peanut butter

2. Write dozens of attitudinal statements about the object. As you compose these statements, try to create a range of attitude levels. Write some statements that are very positive, some that are moderately positive, some that are moderately negative, and some that are strongly negative.

251

252  Part IV 

■ 

Researcher-Made Instruments



Examples Strongly positive

I love peanut butter more than almost anything!

Moderately positive

I like peanut butter.

Moderately negative

Peanut butter is expensive.

Strongly negative

I really hate peanut butter!

3. Create a panel of “judges.” Judges are regular smart people who will tell you how strongly worded each of your statements is and whether they are positive or negative. Show them each statement. Classically, these are presented on cards—one statement per card. Each judge independently sorts the statements into piles based on how strong they believe the wording is. They are asked to try to form 11 piles—pile number 1 should be the most strongly negative, pile 11 should be the most strongly positive, the middle pile (6) can be thought of as neutral, and the other piles (2, 3, 4, 5 and 7, 8, 9, 10) should be for all those statements somewhere in between. In this way, they rate how many points each statement should receive. (You could also probably just give judges a list of statements and have them score them using the 1 through 11 rules, but a pure Thurstonian would have them sort cards, because this makes it easier to adjust one’s comparative thinking as one sees more and more statements.)

1

2

3

4

5

6



Strongly



Negatively



Worded



Remember: It doesn’t matter whether judges agree or disagree with the statement; the judges are asked only to rate the statement in terms of the words used. How strong a statement is it?

Neutral

7

8

9

10

11 Strongly

Positively Worded

4. Average the ratings across judges for each statement. These averages become the weights for the statements; they are the point value for each statement. Choose a range of items based on their point values. Real-world scales try to have at least 11 statements so that a full range of strengths is covered. 5. Now you are ready to give your scale to real people and measure their attitude toward peanut butter. Ask them which statements they agree with (there’s no range here from strongly disagree to strongly agree). Take the point values for all the statements they agreed with and add them up. That total is their score. This method produces scores that are more clearly equal-interval level with higher scores having a more positive attitude.

Chapter 14 

■ 

Surveys and Scale Development  

Scoring works like this. Imagine someone fills out your survey this way, with the bold type indicating their answers: Do you agree?

Statement

Yes  No

I love peanut butter more than almost anything!

10.3

Yes No

I like peanut butter.

 8.4

Yes No

Peanut butter is expensive.

 3.5

Yes  No

I really hate peanut butter!

  1.2

Total

Weight

11.9

Because they endorsed (see their Yesses bolded?) two items worth 8.4 and 3.5, this person would get a score of 11.9. By itself, 11.9 is hard to interpret, but it can be compared to other scores because this method creates variability with higher scores meaning more positive attitudes and lower scores meaning more negative attitudes. The variability and interval-level measurement of these scores create perfect variables to correlate with each other or compare groups on or whatever! (You might notice that even the most negative person when it comes to peanut butter still gets some points. If this bothers you, instead of a 1 to 6 to 11 scale, use a –5 to 0 to +5 scale.)

DON’T IGNORE ME! STRATEGIES FOR INCREASING RESPONSE RATE The best-written survey isn’t much help to researchers if no one fills it out. We have sat on so many student dissertation and thesis committees and collaborated with colleagues on their own research who have experienced the nightmare (okay, nightmare is a slight exaggeration) of not getting a big enough sample, that we have grown overly cautious on expecting a good response rate. Consequently, we wanted to spend some quality time with you talking about a concept called Social Exchange Theory. Social Exchange Theory says that humans will agree to a trade under certain conditions. And asking people to fill out your survey is an offer of trade—they give you data and you give them . . . what exactly? Well, Social Exchange Theory suggests that you do give them something in exchange. Read on. An exchange is likely to occur when three criteria are met: • Rewards are high. • Costs are low. • Trust has been established.

253

254  Part IV 

■ 

Researcher-Made Instruments

So, simply put, if it isn’t too much trouble, and I might get something out of it, and you seem friendly, I’ll fill out your darn survey. Here are some ways to include perceived rewards in your request for participation. (By the way, these suggestions and this section on Social Exchange Theory is guided by the decades of writing by the distinguished researcher Don Dillman, a leader in modern survey methodology.) • Provide information about the survey. • Show positive regard. • Thank the respondent. • Ask for advice. “We need your input.” • Support group values. “As a public-school teacher, we are asking you . . . ” • Provide social validation. “Many have responded.” • Provide incentives. This can be actual money or a gift card or some token of appreciation, like a key chain or sticker. • Make the questionnaire interesting. • Inform respondents that time or opportunity to respond is limited. To keep the perceived costs low, consider these tips: • Avoid demanding language. “You must respond.” • Avoid using language that respondents will not understand. • Include a direct link to the online survey in the email. This is the modern version of the classic advice to include a “stamped self-addressed envelope” for surveys sent through the U.S. mail. (And this is still done today.) • Make the questionnaire appear short and easy to complete. The survey can either actually be short or formatted in a way that allows for quick responding. • Do not ask for personal information that is not critical to the design of the study. Don’t automatically ask for all sorts of demographic information unless you have a research question about it or some other important reason. • Emphasize similar requests to which participants have already agreed. “Thank you for agreeing to come here today.”

Chapter 14 

■ 

Surveys and Scale Development  

Finally, you want participants to trust you. This can be done by • Providing a token of appreciation. This is like providing an incentive to increase the perception of rewards, except here the payment is given before the person takes the survey. • Securing sponsorship. Will your school’s dean write a cover letter urging participation? • Putting forth enough effort in constructing the survey to make the task appear important. Proofread and test your survey carefully before using it in real research. • Assuring and ensuring confidentiality and security. This means promising privacy, confidentiality, and anonymity (if you can) and then actually making sure you keep that promise. Calculating response rate. If very few people choose to participate in your study, that might create some concern. In addition to statistical requirements for a certain minimum sample size, you might also wonder: Is there something different about those who chose to take part compared to those who did not? And will that make generalizability of your results questionable? Answering this question is complex and response rate is just one factor in judging representativeness, but it is a useful start to compute the proportion of people who agreed to be in your study. The math looks like this: Response rate =

Number of people who filled out your survey Numbber of people who you asked to fill out your survey

There’s no rule about how high a response rate you need for your research to be generalizable, but obviously a 65% response rate is better than a 5% response rate. And, by applying Social Exchange Theory, you can push that rate up as high as possible.

Summary There is a lot of science and best practice based on research when it comes to writing a survey. And real-life people like you that do their own research write their own survey instruments. We pretty much know how to design surveys, write questions that work well, and increase the response rate for our surveys. There is more than one approach for measuring the most common construct that researchers have in mind—attitude toward something. Among these approaches, the format you see the most by far is Likert. The Likert-style survey has those “Strongly Disagree” to “Strongly Agree” items.

255

256  Part IV 

■ 

Researcher-Made Instruments

Time to Practice 1.

What are some of the benefits of conducting a pilot administration of your survey?

2. When researchers report the results of a pilot study for their instrument, they focus on validity and reliability evidence, but often do not even report the actual scores their sample got. Why not? 3. Why is it okay to ask questions in a leading way if you are afraid that respondents might judge the social desirability of their answers before responding? 4. How are Likert and Thurstone scales similar? 5. How are Likert and Thurstone scales different from each other? 6. Think about Social Exchange Theory. How might it explain why you are trying your best in this course?

Want to Know More? Further Readings Story, D. A., & Tait, A. R. (2019). Survey research. Anesthesiology, 130(2), 192–202. This is a great stand-alone manual for writing a survey. This covers basically everything you need to know! Speklé, R. F., & Widener, S. K. (2018). Challenging issues in survey research: Discussion and suggestions. Journal of Management Accounting Research, 30(2), 3–21. Like the first reading listed above, this is a nice summary of survey research issues written for a particular field to introduce them to the methodology. This article focuses on some of the common problems with survey research such as starting with a representative sample and the limitations of self-report.

And on Some Interesting Websites •

The Pew Research Center has a nice primer on surveys as research tools at https://www .pewresearch.org/our-methods/u-s-surveys/writing-survey-questions/.



Tulane University’s School of Social Work is happy to explain Social Exchange Theory to us. And all we have to do is go to https://socialwork.tulane.edu/blog/social-exchange-theory. Seems like a fair trade.

And in the Real Testing World Real World 1 Some real-world researchers use the Thurstone approach discussed in this chapter, which pleases the statistician in us. Some researchers in Brazil were curious how well children understood the

Chapter 14 

■ 

Surveys and Scale Development  

concepts of various colors—like how blue is different from violet. They used Thurstone’s methods to discover, among other things, that kids understand blue a lot better than violet. Want to know more? Costa, M. F., Gaddi, C. M., Gonsalez, V. M., & de Paula, F. V. (2021). Psychophysical scaling method for measurement of colors concept in children and adults. Methods in Psychology, 5(8), 100077.

Real World 2 Does Social Exchange Theory still explain workplace relationships between employees and employers? These researchers wondered if this framework works for the modern labor experience, which includes frequent changes in jobs, duties, and bosses. They conclude that it is still a useful theory to explain the real world if one takes these variables into account. Want to know more? Chernyak-Hai, L., & Rabenu, E. (2018). The new era workplace relationships: Is social exchange theory still relevant? Industrial and Organizational Psychology, 11(3), 456–481.

Real World 3 What happens when respondents get tired or bored with a long survey and start responding in an “inattentive” way? Nathan Bowling at Wright State University and colleagues wondered if there was point at which people taking a survey stop paying attention and conducted two studies to find out. They concluded that the further respondents got in a long survey, the more careless they became. There was some evidence that the carelessness might be less if the session was proctored or if respondents knew that carelessness would be checked. Bowling, N. A., Gibson, A. M., Houpt, J. W., & Brower, C. K. (2021). Will the questions ever end? Person-level increases in careless responding during questionnaire completion. Organizational Research Methods, 24(4), 718–738.

257

T

he foundational principle in the world of measurement in the social sciences is that tests should be valid. There are a variety of ways to define that goal, but it essentially means that a given test works well for its intended purpose. In practical terms, and using some of our measurement jargon, the score should reflect the construct of interest.

If you ask the “citizen on the street” (or, probably safer to ask the “citizen on the sidewalk”), they’d tell you the same thing, except instead of valid, they’d say tests should be fair. But what is fairness? There are several ways to define fairness and these next two chapters break down the concept using these different definitions of fairness. We look at fairness in terms of

PART V

FAIR TESTING

• Test bias. Are certain types of tests biased against identifiable groups of people, such as people of different genders or races, or with disabilities? How do test developers guard against bias? Are there whole approaches to testing that are, in and of themselves, unfair? • Equity and universal design. Can we design tests that are equally valid for everyone? Equity, the idea that we should provide equal access to all, is a fairness concern in testing, as it is in education and, more broadly, in society. How does this goal play out in our measurement world? • Laws and ethics. What laws exist to support fairness in testing? What ethical principles have been adopted by the various professional organizations to which measurement folks belong to advocate for fair and just testing?

259

15 TRUTH AND JUSTICE FOR ALL Test Bias and Universal Design Difficulty Index ☺ ☺ (something to think hard about)

LEARNING OBJECTIVES After reading this chapter, you should be able to • Explain the definition of bias as used by measurement professionals. • Describe the ways that test developers protect against bias. • Interpret item analyses that look for bias. • Explain the concept of universal design of assessment. • List several important principles for making assessments using universal design.

H

ere’s what we know: Tests are not uniformly fair and just, and in many cases, they can be downright biased against whole groups of people. Here’s what we don’t know: the best way to make them a lot better (or fairer). One approach, though, universal design, holds promise as an approach for producing scores that only measure what they are supposed to. And you know by now (thank you Chapters 1 through 14) that tests are only valid when they measure what they are supposed to.

261

262  Part V 

■ 

Fair Testing

Ask anyone you meet—your neighbor or a testing company CEO—what don’t you like about the use of standardized tests to make big decisions about children and adults? They’ll most likely say they worry about test bias. And the more one knows about measurement—both the reliance on standardized large-scale testing and the ways we assess students in the classroom—the more one may have reason to worry. This chapter talks about test fairness (which is a broader concept than bias). Think of fairness, the way we will use the term, as the extent to which a test works equally well for each person who takes it. But to use the terminology we like, a test is fair when it is valid for everyone.

THE $64,000 QUESTION: WHAT IS TEST BIAS? This is the $64,000 question, because the definition of test bias gives us a clue as to how it might be remedied. Unfortunately, the definitions are complex and introduce factors that are often very difficult to deal with in a complex society such as ours. Let’s look at a combination of many different definitions to give us a flavor of what we are dealing with. In its simplest form, test bias occurs when test scores vary across different groups because of factors that are unrelated to the purpose of the test. For example, imagine that males and females in the 11th grade have equal math ability. But what if we are administering a standardized math test to both males and females and they systematically differ in their average scores? Because we start with the premise that males and females have the same math ability, a test that finds a difference must be measuring something other than math ability. And when this imperfect test falsely creates apparent differences between broad groups, that suggests bias. Tests have been, and still are, accused of being biased toward different genders, races, ethnicities, and against groups of people defined by a variety of other characteristics such as having a disability or not speaking English as a first language. Note that, by this definition, two things must be true for a test to be biased: 1. The test finds average differences in scores between relevant groups of test takers, and 2. There are not, in fact, average differences between the groups in the level of the trait. So, if males and females in the 11th grade do actually differ in math ability, what then? For example, it used to be a stereotype that girls were better at verbal skills like reading and boys were better at number skills like math. Well, if there is a true difference based on gender, then a test that finds those difference is not biased. We

Chapter 15 

■ 

Truth and Justice for All   

used to find such differences, by the way, and that is mainly where the myth came from that “math is hard” for girls. We don’t find those differences much anymore when we test kids, though, and this may be because we have improved our testing methods to avoid bias (no more math questions based on sports statistics). The differences may also have gone away because we do a better job of treating young girls and boys similarly when it comes to opportunities to learn science, computing, engineering, math, and related STEM areas and experiences. It may have been their attitudes toward math that affected math performance, not actual cognitive ability differences. All’s Fair When Defining Fairness. In Chapter 16, we briefly review the professional Standards for Educational and Psychological Testing. Implicit in these standards are four views regarding fairness. The first is that to test fairly and without bias, you have to provide opportunities for testing in a secure and controlled environment. The second is that the test must have no inherent bias already built into it. The third is that whatever construct is being tested, the test taker must have full access to even taking the test in the first place with no obstacles regardless of such demographics as gender and ethnicity. And finally, the fourth is that test scores must be fairly interpreted.

Test Bias or Test Fairness? So far, we’ve talked a bit about test bias and how it is present when a test differentially favors one group over another in the absence of real differences in ability. But test fairness is another story, one that we should introduce to you as well. Test fairness touches on the very sensitive issue of the use of tests and the social values that underlie that use. Whereas test bias is the result of an analysis that any of us can learn to apply, fairness is a question of values and judgment—both topics that most of us have to think long and hard about until we think we are doing the right thing. In other words, just because the test you use can distinguish between adults who are “smart” and those who are “not so smart” (and do it in an unbiased manner) does not mean that the judgments made based on these results provide a valid basis for placing people in different social categories or imposing interventions and all the implications that result, such as employment and social opportunities and so on. For that matter, when should these constructs be tested and for what purpose (and on and on and on)? For example, let’s say you are the admissions officer for a college and you use a set of selection criteria such as SAT scores, the number of hours of extracurricular activities in which applicants have participated, or even whether applicants can afford to pay school costs. You want to be darn sure that there is some philosophy or set of value-based rules that underlies your model of how this admissions

263

264  Part V 

■ 

Fair Testing

information will be used. And this philosophy should be written down and explicit so anyone can read and understand it. For example, you may deem that SAT scores are important but will count for only 25% of the weight in determining if someone is admitted. Or you may deem that ability to pay is irrelevant or that cumulative high school GPA is important. And of course, you have to state why. Why should SAT scores count for only 25%? Why not more? Why not less? We have studied validity in many different places throughout this book, and you can always think of a valid test (and the many ideas that are conveyed) as the gold standard—the goal we seek in our development and use of tests. Indeed, a valid test should be both unbiased and fair. And Samuel Mesick takes our conventional definitions of validity and extends them one step further to what he calls consequential validity. Consequential validity is a simple and elegant idea. We need to be concerned about how tests are used and how their results are interpreted. We worry about construct validity and whether a test does what it is supposed to do, but we often forget about why we are using a test in the first place, the importance of how we interpret scores, and the consequences of this or that interpretation. And in these coming years of more reliance on test scores as a tool for assessing teaching effectiveness, now more than ever is the time to consider such a question (and answer). Notice how Messick has broadened our definition of validity—it is no longer enough to show that the score reflects the trait! The test must work to better people’s lives through its use, or it has low consequential validity.

MOVING TOWARD FAIR TESTS: THE FAIRTEST MOVEMENT Plenty of people are dissatisfied with the “business” of test administration and want the way tests are designed and administered, and the way test scores are interpreted, to change radically—all in an effort to reduce bias due to gender, class, culture, or any other factor that results in invalid test scores and in the inappropriate use of them as well. One group, FairTest—the National Center for Fair and Open Testing (at http:// www.fairtest.org/)—is an organization that “works to end the misuses and flaws of standardized testing and to ensure that evaluation of students, teachers and schools is fair, open, valid and educationally beneficial.” Who could argue with that? But indeed, there have been plenty of arguments, even with the basic principles that FairTest espouses (and these are their exact words): 1. Assessments should be fair and valid. They should provide equal opportunity to measure what students know and can do, without bias against

Chapter 15 

■ 

Truth and Justice for All   

individuals on the bases of race, ethnicity, gender, income level, learning style, disability, or limited English proficiency status. 2. Assessments should be open. The public should have greater access to tests and testing data, including evidence of validity and reliability. Where assessments have significant consequences, tests and test results should be open to parents, educators, and students. (This is similar to New York’s Truth in Testing law that you will read about in Chapter 16.) 3. Tests should be used appropriately. Safeguards must be established to ensure that standardized test scores are not the sole criterion by which major educational decisions are made and that curricula are not driven by standardized testing. 4. Evaluation of students and schools should consist of multiple types of assessment conducted over time. No one measure can or should define a person’s knowledge, worth, or academic achievement, nor can it provide for an adequate evaluation of an institution. 5. Alternative assessments should be used. Methods of evaluation that fairly and accurately diagnose the strengths and weaknesses of students and programs need to be designed and implemented with sufficient professional development for educators to use them well. These are lofty goals, and even if, at times, they are very difficult to reach, they are well worth striving for.

MODELS OF TEST BIAS Now that we have a working definition of test bias, let’s extend that discussion a bit and examine what models or frameworks have been presented that elaborate on this basic definition.

The Difference-Difference Bias Well, this is surely the most obvious one, but one most fraught with some very basic problems. This model says that if two groups differ on some factor obviously unrelated to the test, such as race, gender, racial or ethnic group membership, or disability status (you get the idea), then the test is biased. Now, what’s good about this model is that the assumption might be correct, and it is surely a simple way to look at things. For example, scores on a particular intelligence test may indeed reveal a difference between males and females, and maybe that is a result of test bias. But here’s the problem with this line of thinking: Maybe the difference it reflects is real, and there’s little, if any, bias. Wow—so much for test bias, right?

265

266  Part V 

■ 

Fair Testing

Let’s take another example. Imagine that there is a difference found in achievement test scores between students who grew up in an English-speaking country and those who did not. It is reasonable to assume this is bias, but it’s possible that the performance differences reflect real differences in knowledge, not just problems with understanding the high level of vocabulary on the test. Different cultures with different education systems may emphasize different areas of training and differ even on which chunks of knowledge are determined to be important to learn. So, these two groups of students may actually know different stuff or different amounts of the same stuff. If that’s the case, then the test is valid (except maybe in Messick’s sense of consequential validity). Of course, if it is ONLY the difficulty of understanding the language that causes the differences, then bias probably is a concern (because the test is supposed to measure achievement, not language skills). So differences in performance are useful to note, and it’s even more useful to find out the source of these differences. But claiming that a difference in performance is the result of a test’s bias without having evidence has no more basis in fact than claiming that the groups were differentially prepared to take the test. This is not easy stuff, especially when we recognize that testing and assessment are a multibillion-dollar industry that affects almost everyone in one way or another. In 1955, test sales were about $7 million for that year, and today that dollar amount is up to almost $1 billion each year—a nice payday. And who are the big boys and girls in this test-publishing market? Harcourt Brace, CTB McGraw-Hill, Riverside Publishing, and Pearson.

ITEM BY ITEM Here’s a step in the right direction, given what we just said about differences in group scores. Instead of looking at overall group performance, perhaps an analysis should be done on an item-by-item basis, where group performance is examined for each item and any discrepancies studied further. And (of course, if you’ve figured out how we write these paragraphs so we sound like geniuses), that’s just what testing companies do! They use Item Response Theory information to make bias visible at the item level! We talk about Item Response Theory in Chapter 6, but the basics are that the difficulty of a question (like on a multiple-choice test) is evaluated at each level of ability. And a type of analysis called differential item functioning (or DIF) can see if two groups of people differ on their item difficulty even after making sure those groups are equal on overall ability! So clever and (in retrospect) so obvious a way to see if bias exists. If students from different groups, say, males and females, perform differently on a single item, there might be bias. But if you find males and females with the same score on the total test (a reasonable measure of true ability)

Chapter 15 

■ 

Truth and Justice for All   

and they still differ on that item, then bias is almost certain! Here’s a picture to see what DIF looks like. In Figure 15.1, we have plotted the difficulty level for two groups of students—white students and black students—for an imaginary item, Question 16. Along the bottom are the total scores each student got on the test; along the side is the item difficulty (the percentage of students who got the item correct). Notice that regardless of the ability level, the item was more difficult for black students than for white students. Because this difference exists when comparing students with the exact same level of knowledge, this chart is strong evidence of bias for this item.

On the Face of Things Model Remember content validity, which was discussed in Chapter 4? This type of validity says that an item is a fair representation of the type of item that should be on a test. Sometimes experts are used to make this judgment. It can be the same with test items where biased items misrepresent ethnic groups, age groups, and gender characteristics. For example, what about a set of pictorial items that uses only white people as examples? That has nothing to do with the test items per se, but it sure suggests that nonwhite test takers aren’t exactly likely to identify with the people in the illustrations. Or what about the use of stereotyped language, such as reading comprehension passages that characterize men in certain occupations and women in others? Not good. Historically, there have been questions on achievement tests that made assumptions about the income level of students and treated all students as having grown up in middle-class homes. For example, a question that expects students to know that the words cup and saucer   Item Characteristic Curves for Question #16 Based on Race

White Students

100% Chance of Getting Question #16 Correct

FIGURE 15.1 

Black Students 75%

50%

25% 400

700 1000 1300 Total Score on the Test

1600

267

268  Part V 

■ 

Fair Testing

go together takes it for granted that we all use cups and saucers when we have our morning tea and crumpets. This wasn’t even true in the 1950s when items like this were created. (Heck, even your authors, who have big-time state college salaries and sell literally hundreds of books a year don’t use saucers with our cups.) Real-life test companies use panels of experts—often experts in the culture of racial, ethnic, and other groups—to nominate items like these that may be biased in a cultural sense. Those items can be removed, modified, or examined further (using methods like DIF) to see if the test would be less biased without them.

The Cleary Model Here’s another statistical way to identify bias that makes a lot of sense and can identify bias in the entire test, not just in a single question. Way back in 1968, T. Anne Cleary defined test bias as a test that measures different things for different people. She devised a sophisticated correlational model to show this (see Chapter 5 for a description of correlations), but the fundamental assumption is that people with the same test scores (such as on the SAT) should do equally well when related to some external criterion (such as first-year grades in college). So if a test is not biased, then, for example, both black and white students’ SAT scores should predict equally well their first-year college grades. This is a regression model (a more advanced form of correlation), and it’s a pretty cool way to examine test bias. The key here is to look at tests that are supposed to predict the future and see if they do so equally well for all groups.

Playing Fair It’s really not hard to inadvertently create items or tests that are biased. In fact, because we all have our own stereotypes, they can sneak pretty easily into any test that we devise or prevent us from seeing clearly how another test might be biased. So here’s our handy-dandy four-step procedure to ensure that the tests you create and the tests you use as created by others are not biased. 1. In the development phase, try to be as clear as possible in recognizing your own biases or stereotypes when it comes to different groups of people (be they different because of race, gender, ability, rival high school—whatever). Know the sample of test takers with which you are going to be working, and make every effort to eliminate your own biases from the design of items and tests. Have no biases? Congratulations—you are the only such person in the galaxy. Test Maker, Know Thyself. Recall that quote about throwing stones and living in glass houses? As a test designer, giver, or scorer, the more you know about your own biases and prejudices (and yes, we all have them), the fairer you can be and the more aware you will be of the impact your own beliefs have on your

Chapter 15 

■ 

Truth and Justice for All   

evaluations and assessments. And bias isn’t always or even usually driven by evil intent or prejudice, it is more often due to naïve ignorance that one’s own way of thinking is deep within a cultural set of expectations and assumptions. 2. If in doubt (and we should all have doubt about test bias to some extent), show it to a buddy who belongs to the very group you feel might be slighted. Don’t think that the description of women is fair on the aptitude test you are about to administer? Ask a female colleague for her judgment. 3. If at any time you find your own test development efforts leaning toward biased, stop and start over, asking for help from someone more experienced than you. If you find that you are using a commercial or standardized test, write to the publisher. Any good publisher welcomes such feedback, because their primary aim is to please their consumers (you) and to develop a test that is as unbiased as possible. 4. Test takers are different from one another, not only in their abilities or personalities but in the ways they learn and what forms of assessment are most accurate for them. Is a written test unfair to a learner whose strength is in aural learning? We would not expect a visually impaired student to read items, right? Any test that would demand such would be biased and unfair, and you would be particularly insensitive to administer such a test. But there would probably not be any problem presenting the test orally and allowing the student to type or use some kind of assistive device.

UNIVERSAL DESIGN Whether a test is systematically biased toward members of a certain group is certainly an obvious example of fairness or unfairness. A less obvious way to think about fairness, though, is the question of whether a test works for everyone. Might it be valid for you but not the person next to you? If you have a larger vocabulary or lower text anxiety or are more comfortable with multiple-choice formats or have better eyesight or get all the pop culture references in the “obviously” wrong distractors, then, well, yes, the same test is more valid for you. Notice these differences between you and your classmate are all characteristics that are not related to the construct of interest—whatever the test is supposed to measure, like knowledge or skill or intelligence or whatever. We expect differences in the measured trait across students to affect the score, but we want those differences to be the only thing that affects the score. Can we build a test that does a good job of measuring that construct and only that construct and works equally well for every person in the universe? Yes, and we call this approach universal design. (Would have been funny to say no here and just end the chapter, right?)

269

270  Part V 

■ 

Fair Testing

Riddle Me This: When Are a Building and a Test the Same? The heading for this section presents a riddle. The answer is when you want them to be accessible to everyone. Starting in the 1960s there was a movement toward accessibility. It started with government buildings. It suddenly occurred to people that all citizens should be able to access a building to get to government offices regardless of a physical disability. So buildings began to be redesigned. Think wheelchair ramps and elevators and fewer doorknobs and wider hallways and signs written in Braille and so on. Like all good ideas, within a few decades the concept had expanded to include all buildings and housing, then all services and opportunities like employment and then educational access (learn about full inclusion laws in Chapter 16!) and then the way we teach, and, ultimately, as far as this chapter is concerned, even the way we design tests. Equal accessibility for a test means equal validity for all who take it. Just as architects use universal design principles so their buildings work for everyone, professional test developers use similar design principles so their tests work for everyone. So they are universal. This design approach is called universal because one test should work for all. Wheelchair ramps are, actually, just ramps, right, because they work for everyone and (here comes a big idea) they are easier to use than stairs! They work better for everyone! A screwdriver with a more comfortable grip is easier for everyone to use, whether you have arthritis or not. A test developed using universal design guidelines is more valid for everyone, not just for the student with a learning disability.

Designing the Best Tests in the Universe There are dozens of guidelines, some based on research, some based on philosophy, for building a test so that it works well for everyone (or is fully accessible, as architects would say). Here are a few: In terms of the mechanics and formatting, • Text formatting. Text should be flush to the left margin because that is easier to read for most Westerners. “Fully justified” text (which is spaced so that both the left and right margins are flush or straight, like parts of this book) is difficult for even expert readers to handle. • Typefaces. Certain typefaces (what we all wrongly call fonts) work best for those who are visually impaired. Studies show that 14-point type (or even larger) actually increases test scores, compared to smaller sizes, for all students. • White space. A lot of blank space around questions and images and other page elements is believed to lower test anxiety and also make information

Chapter 15 

■ 

Truth and Justice for All   

clearer. And we mean a lot—like half the page should be white (or black or green or whatever color a blank computer screen is). Notice the pages of this book. There’s a lot of stuff on them—text, lists, figures, tables, boxes—but there’s a lot more emptiness actually. (That’s why it’s so relaxing and fun to read our book!) • Contrast. The difference between text and background has been studied and the recommendation is (for paper tests) to use off-white or light pastel colors with a nonglossy coating to prevent glare. Type should be black. • Illustrations. Illustrations can be a problem if they cause competition for attention between picture and text. Black-and-white drawings will be clearest, but it is okay for pictures and photos to be in color to create some energy, but avoid green/red combinations because some students may have color blindness. In terms of the content and wording of tests, tests are accessible when, • Test takers share the same experiences and prior knowledge necessary to figure out what is being asked and what answer options are available. • Regardless of the test taker’s development level, the complexity of the sentences and the vocabulary used should be appropriate. How is this possible? Write questions so they work for the lowest age that might conceivably take the test. Not even a super-smart test taker ever complained that the vocabulary used in questions was too simple. • Break complex sentences into shorter sentences. Even if they are not complete grammatically. You know? Like the way people talk? The way we do here. Get the idea? Adaptive Testing. There’s something relatively new on the scene (which to textbook writers means we have only been doing it for a few decades) called adaptive testing or computer adaptive testing (CAT). This technique uses computers (duh) and selects which items will be administered based on the estimated difficulty of items and the estimated ability of the test taker. In other words, based on how someone does across the first set of questions, the program will choose subsequent items to maximize reliability by really challenging them. This increases reliability dramatically and can even shorten the test for some. If you took an achievement test in the last few years online or on a computer, there’s a good chance you got a slightly different set of questions than your friend who took the “same” test. Interestingly, though customizing the test increases the fairness of the experience (by increasing reliability and, therefore, validity), this approach doesn’t really match the philosophy of universal design, which has the goal of using one, really good, version of a test for everyone. There is more than one way to make testing fair.

271

272  Part V 

■ 

Fair Testing

Summary Test bias and test fairness are surely not easy topics to deal with, but they have to be confronted if our practice of testing (on the increase, for sure) is going to mean anything and be used for the common good. As a practitioner or researcher (and even test taker!), you’re on the front lines of this discussion and get to choose your own position in this fight.

Time to Practice 1. Along with three of your classmates or colleagues, create a five-question test that is as biased as you can possibly make it. Be sure that you ask another four members of your class to do the same, or perhaps the entire class can be divided into such groups. Once you are done, trade questions with another group and answer the following questions: a. Most important, what personal stereotypes or personal biases may have been operating when the questions were constructed? b. Why are the test questions you are examining biased? c. Is it possible to change them to be less biased or unbiased? 2. What is the difference between test bias and test fairness? 3. What is the difference model of test bias, and what might be some of its shortcomings? 4. List five social implications related to the use of biased tests. 5. Is a culturally fair test of anything possible? 6. Many school districts allow for test accommodations, where students with disabilities take a slightly different version of a test (or the same test, but with modified rules). This makes the testing experience fairer and makes sense as a good option to consider. It actually is not recommended by those who advocate for universal design of assessments, though. Why not?

Want to Know More? Further Readings •

Gould, S. J. (1996). The mismeasure of man. New York: Norton.

This is a revision of that classic book from 1981 that brought the issue of validity and testing to the public. The newer edition responds to concerns at the time of a new movement (or actually the revival of an older belief) in educational theory with some researchers arguing that races actually differ in natural intelligence and the differences on IQ scores were not entirely due to socioeconomic, cultural, and environmental factors. Gould provided a timely counterargument to that position by

Chapter 15 

■ 

Truth and Justice for All   

emphasizing again the many weaknesses in intelligence tests and that scores on a test are not real things and not the same as the trait they are meant to reflect. •

Elliott, S. N., Kettler, R. J., Beddow, P. A., & Kurz, A. (Eds.). (2018). Handbook of accessible instruction and testing practices: Issues, innovations, and applications. New York: Springer.

The many aspects of accessibility in education—past, present, and future—are explored in this book. For our purposes, the chapter on the universal design of assessments, Rose, D. H., Robinson, K. H., Hall, T. E., Coyne, P., Jackson, R. M., Stahl, W. M., & Wilcauskas, S. L. (2018). Accurate and informative for all: Universal design for learning (UDL) and the future of assessment (pp. 167–180), is invaluable.

And on Some Interesting Websites •

The good people at the Southern Poverty Law Center offer a very useful review of test bias among other fairness issues at http://www.tolerance.org/Hidden-bias with their Learning for Justice project. The center also goes into some depth discussing what test bias is, how it is learned, and how it is perpetuated.



The National Center for Fair and Open Testing (or FairTest) at http://www.fairtest.org/index. htm offers oodles of information about testing bias and the many different projects in which FairTest is currently involved, including misuse of the SAT and ACT and how to improve student assessment.



Project Implicit is where psychologists from Harvard, the University of Virginia, and the University of Washington developed Implicit Association Tests to measure unconscious bias. Go to https://implicit.harvard.edu/implicit/ and sign up for Project Implicit Social Attitudes, and find out a bit about your own stereotypes. Be warned, though, your results can be hard to take.

And in the Real Testing World Real World 1 What to do about racial and cultural test bias, which is a major concern in the placement and treatment of children with disabilities? These are surely social, political, and emotional issues, but this author emphasizes how this concern has to be addressed from an empirical standpoint. This paper reviews research that evaluates the merits of the cultural-test-bias hypothesis and finds that there is little evidence to substantiate claims of bias for well-constructed, properly standardized tests. This was written by a respected school psychologist and researcher years ago and still rings true today. Want to know more? Reynolds, C. R. (1983). Test bias: In God we trust; all others must have data. Journal of Special Education, 17, 241–260.

Real World 2 What does it mean to say that assessment of students is fair? An international group of scholars (whose work we have recommended already in this book) says that it is unfair to define fairness purely in terms of what measurement people say is fair, that is, a valid and reliable score. They argue that the entire culture of classrooms and the relationships among teaching, learning, and assessments must be taken into account.

273

274  Part V 

■ 

Fair Testing

Want to know more? Rasooli, A., Zandi, H., & DeLuca, C. (2018). Re-conceptualizing classroom assessment fairness: A systematic meta-ethnography of assessment literature and beyond. Studies in Educational Evaluation, 56, 164–181.

Real World 3 What does universal design look like in a college class? (Does your instructor follow universal design principles?) Researchers from Oklahoma and Colorado describe real-world use of universal design for learning (UDL) and assessment in college classrooms, which have become more and more diverse in the last decade. They argue that UDL shows promise in meeting the needs of all students and meeting goals of full accessibility. Want to know more? Boothe, K. A., Lohmann, M. J., Donnell, K. A., & Hall, D. D. (2018). Applying the principles of universal design for learning (UDL) in the college classroom. Journal of Special Education Apprenticeship, 7(3), n3.

16 LAWS, ETHICS, AND STANDARDS The Professional Practice of Tests and Measurement Difficulty Index ☺ ☺ ☺ (moderately easy and provocative as well)

LEARNING OBJECTIVES After reading this chapter, you should be able to • Explain five major federal laws that govern the use of tests and measurement in the United States. • List several important ethical rules of conduct for measurement folks. • Identify the basic principles behind the standards promoted by professional organizations.

I

f you create, administer, or score tests or you interpret test scores or you are a butcher, a baker, or a candlestick maker, a doctor, a police officer, or a teacher— you need to know about the legal and ethical aspects of testing and assessment. This chapter introduces you to the federal legislation, professional standards, and ethics that affect the use of tests in our society. Any one of these topics deserves

275

276  Part V 

■ 

Fair Testing

hundreds of pages of detailed examination and lots of discussion. We don’t have the room here for the detailed examination, but it’s critical for you to at least know the rules guiding appropriate conduct when it comes to high-stakes tests.

WHAT THE GOVERNMENT SAYS Recently (well, starting about 50 years ago), Congress began legislating how federal funds should be used in the schools. Because essentially all states and schools are dependent on federal funds, these laws guide what schools can and cannot do and must do. We are interested in looking at the laws that regulate testing and assessment. Let’s look at a pair of acts that “require” yearly state testing for almost all students (collectively referred to as No Child Left Behind), legislation that supports full inclusion and access to education for students with disabilities, a state’s declaration that you “like big tests and you cannot lie,” and the law that protects the right to privacy for information on how well you are doing in college (including the grade you’re getting in this course!).

Essa and Nickleby: The Every Student Succeeds Act (ESSA) and No Child Left Behind (NCLB) The purpose of the No Child Left Behind Act (NCLB), signed into law in 2002 by president George W. Bush (and based on the Elementary and Secondary Education Act of 1965), is “to close the achievement gap with accountability, flexibility, and choice, so that no child is left behind.” Sounds good, and no one would object to this end, but as always, the devil is in the details. And there were tons of details. NCLB had as a primary mission the ensurance that all children (and it really is all) will meet or exceed their state’s level of academic achievement (as measured by state assessments). So to meet this requirement (and failure to meet it in theory can lead to losing federal funds), a huge amount of testing has to occur on a regular basis. Just think of every student in every public school in the United States being tested across math and reading (and science). It’s staggering. And these subjects must be taught by “highly qualified” teachers who must be “certified” by the state. Also sounds good, but what does “highly” mean? And how are teachers certified? And what should the standards be for passing? All of these criteria are left to each state to define how it wishes. As you might expect, one major problem with these requirements is that they are very expensive to implement. Testing is expensive (our small state of Kansas spends about $10 million each year), as is the training and hiring of only “highly qualified” teachers. Also, the testing that is done is not diagnostic in nature, leading to remedial actions, but rather solely a summative assessment indicating a child’s position relative to his or her peers. The price tag for this and many more provisions was

Chapter 16 

■ 

Laws, Ethics, and Standards  

so staggeringly high (and some of the philosophical assumptions of the legislation were so controversial) that state legislators complained to the federal government and a revision of the policy was passed in 2015, signed by president Barack Obama. The Every Student Succeeds Act (ESSA) provides more support for states for testing and other requirements and also relaxes the requirements about testing each student. Additionally, extra funding for programs and interventions is available for high schools with the most need (those having traditionally underserved populations and those with low graduation rates). The other benefit of ESSA, in the view of some, was that each state now had more choice in terms of how they help students, schools, and districts that score low on state tests. In terms of the tests themselves and how they are made and used, ESSA offers support to states in developing and using high-quality assessments and has a goal of helping teachers in using assessments to foster deeper learning among students. All this talk about assessment is fine but not of much value unless we have some standards to assess outcomes. Even if children are tested in math and reading each year, it’s pretty important to know what a state department of education means when its representatives say that a child or a class or a school system has met the “standards.” Most agree that schools that don’t teach well should be held accountable, but there are plenty of criticisms about the approach taken by ESSA and NCLB. The laws are proving very difficult to adhere to and maybe even impossible to manage. Here are some (and there are plenty) of the objections: • The policy requires that almost all students must be tested, meaning that children even of low proficiency in English and with significant disabilities are expected to perform at grade level. Their scores, like everyone’s, are averaged to produce an indicator of how well the school system has done. Comparing schools and, of course, school districts becomes untenable because of the lack of comparability of the school populations. The “onesize-fits-all” notion is not reasonable. There is some flexibility here, as schools can choose a certain percentage of students not to be included, but that percentage is not big enough to exclude all those who will not “succeed” on these tests for reasons unrelated to the quality of teaching. • Students are not only tested but are also tested using standardized tests, which you may remember from earlier chapters introduce other issues such as potential bias and teachers teaching to the test—a practice that certainly does not help students learn. • Only public schools (and other schools that receive federal funding) are expected to meet these standards. Many private schools and charter schools are not, which creates some inequity and also helps advantage

277

278  Part V 

■ 

Fair Testing

these for-profit schools that don’t have to deal with the hassle of meeting the standards. • The bills have never been funded at the levels proposed by the federal government, leading to shortages in teachers and training for those who are not highly qualified, as dictated by the law. The lack of funds also prevents the use of standardized tests (which are very expensive) to chart progress. • There are no rewards for doing well, only sanctions for not. • What constitutes a “highly qualified” teacher is open to discussion. And even if the school districts know, such teachers are very expensive (they are usually senior faculty). Training good teachers to be better is also expensive. And, so schools can continue to operate, and plenty of exceptions have been allowed so that unqualified teachers are allowed to teach temporarily or in emergency situations (like a shortage). Emergency situations like these, though, are permanent in many states.

Full Inclusion and Universal Access: The Education for All Handicapped Children Act and the Individuals With Disabilities Education Act In 1975, president Gerald Ford signed Public Law (PL) 94-142, known as the Education for All Handicapped Children Act. The law is a statement of affirmation that children with disabilities have the right to a free and appropriate public education in the least restrictive environment (often referred to as LRE). In many ways, it was as much a move toward the civil rights of children with special needs as the passage of the Civil Rights Act was for other groups some 10 years earlier. And in 2010, the 35th anniversary of the law was celebrated in noting that more than 6 million students received special education services. The four purposes of PL 94-142 were as follows (straight from the Education for All Handicapped Children Act of 1975): • “Assure that all children with disabilities have available to them . . . a free appropriate public education which emphasizes special education and related services designed to meet their unique needs” • “Assure that the rights of children with disabilities and their parents . . . are protected” • “Assist States and localities to provide for the education of all children with disabilities” • “Assess and assure the effectiveness of efforts to educate all children with disabilities”

Chapter 16 

■ 

Laws, Ethics, and Standards  

The law was amended in 1997 and became the Individuals With Disabilities Education Act (IDEA). Why such a law? To begin with, more than 10 million children in the United States today have a variety of special needs—from mild to severe physical, cognitive, emotional, and intellectual disabilities that effect their ability to learn—and traditional special education programs were not meeting their needs. In many cases, these programs were not providing any educational experience. And as you might expect, for us to evaluate whether these children are receiving the services they need, a great deal of assessment needs to be undertaken—hence, the importance of knowing about these laws for those of us who study tests and measurement. And has it worked? You bet. Almost 200,000 infants and toddlers and their families, and about 7 million children and youth, receive special education and related services to meet their individual needs. That’s about 14% of all students. Much fewer students received support before PL 94-142. Also, more children attend school in their neighborhood rather than in special schools or institutions (both of which are very expensive propositions), and high school graduation rates and postsecondary school enrollment are up as well. In other words, people who might not have had a chance at a fully integrated public life now have the opportunity. IDEA has six principles, each of which is codified as law, and school districts are required to attend to them in the administration of educational programs. You or a family member or a friend likely has benefited from these services and you may know all about it. First, all children are entitled to a free and appropriate public education. This means that special education and related services will be provided at public expense without any charge to the parents, and these services will meet standards set by the state and suit the individual needs of the child. Second, evaluations and assessments will take place only to the extent that they help place the child in the correct program and measure their progress. The people doing the evaluations must be knowledgeable and trained, procedures for assessment must be consistent with the child’s levels of skill (remember our discussion about bias), and tests must be as nondiscriminatory and unbiased as possible. In other words, visually impaired children should not be required to read, and children with physical disabilities should not be required to manipulate objects if they cannot. The message? Find another way. Third, an Individualized Education Program (IEP) will be developed and adhered to for each child. These written plans for children will be revised regularly and include input from parents, students, teachers, and other important interested parties. Fourth, children with disabilities will be educated in what is called a least restrictive environment. This means they take classes with their nondisabled peers, and

279

280  Part V 

■ 

Fair Testing

only those children who cannot be educated in regular education in a satisfactory fashion should be removed. Fifth, both students and parents play a prominent role in the decision-making process. The more involved both parties are, usually, the better the decisions. Both should have input into the creation of the IEP, the parents should help educate the school personnel about their child, and the child can express his or her needs. Finally, there should be a variety of mechanisms to ensure that these previous five principles are adhered to and, if not, that any disagreements between students, teachers, parents, and school can be resolved constructively. Those mechanisms for resolution have to be in place before students begin programs, not after. Some of the mechanisms include parental consent, mediation, parental notification, and parental access to records. The downside? Well, there is always the cost, but arguments have been made that the cost of not acting and not involving such children is actually more expensive in the long run. Probably, one of the most pressing (and interesting) issues relating to the long-term effectiveness of this law is the accurate diagnosis and treatment of those children who need such services, as opposed to children who are misdiagnosed or simply should not qualify in the first place.

The Truth in Testing Law: High-Stakes Testing Testing is very high stakes. Just ask a high school junior who is taking the SAT, a college premed major trying to get into medical school, or a mechanic trying to pass national standards for advanced engine repair. Or ask the parents of a third grader who may or may not qualify for special education support. And as the stakes get higher, the public becomes more wary of the process used to develop, administer, and score tests. And of course, people always want to know what part a test or score or item plays in the admissions process, be it admission to college or the Peace Corps. Almost 40 years ago (1979), New York state senator Ken LaValle helped ensure that a Truth in Testing law was passed that kind of turned the testing industry upside down. Up until this time, most of the huge test publishers created their tests in a vacuum, with very little regulation or scrutiny from the outside. Their efforts were not subject to scrutiny of any kind, and if a test question was lousy or a test lacked content or criterion validity, that was the publisher’s business and no one else’s. Truth in Testing requires that admissions tests given in New York State (such as the SAT test) be available for review of their content and scoring procedures, that the test items be released to the public, and that there be due process for any student accused of cheating.

Chapter 16 

■ 

Laws, Ethics, and Standards  

No More Required SAT, ACT! ☺ Well, after years of efforts on the part of the testing industry to foil some of the provisions of the Truth in Testing legislation, one unintended consequence was that nearly 1,000 institutions of higher learning, both smaller (such as Bryn Mawr College and Colby College) and larger (such as the University of Nebraska and Wichita State University), have decided not to require the test for undergraduate admissions, or to make it optional. The hardships of the COVID-19 pandemic which made it difficult for many applicants to take these tests forced many colleges to reevaluate their requirements. Well, you can imagine the response of the test publishing companies (and some of it justified). For example, if a test item were made public, that test item would no longer be available in the pool of items from which the test is constructed. Also, the publishers’ copyright would be violated; after all, they “own” the items. And as you know by now, it takes a great deal of effort to create even one item that really works well—both in its power to test the topic and its ability to assess accurately and fairly the content being sampled. The law incurred a gigantic expense for testing companies, but the folks who make the SAT, and other companies, adopted the policy nationwide. In the end, although it meant a more open approach to testing, it also meant a more expensive one for students, because someone had to pay for the extra development time for new items and such, because each item might never be used again. Ultimately, test developers for the most part have become more transparent in their procedures everywhere. You can still make big tests, you just can’t lie. The Goldilocks Problem. A few years ago, there was a controversy over testing that is emblematic of testing practices all over the country: Just what settles for a pass? New York State requires that all students pass a math test (among others) before receiving a diploma, and those who earn a high-enough score receive a Regent’s Diploma (kind of like graduating with honors). Anyway, the math test was given; a huge proportion of students failed; and there was, of course, an uproar from teachers, parents, and students. The basic claim was that the items were just too hard, students could not use scratch paper, the instructions were confusing, and on and on. Whether those criticisms were accurate or not, the test was redesigned and readministered, and a huge number passed, far more than—you guessed it—teachers, parents, or students would have thought. The result is what many call a Goldilocks problem—too hard, then too soft, but never just right. What this exemplifies is that as testing continues to be a high-stakes activity, there needs to be more accountability than ever on the part of the test givers, because the implications for scores that are too high (too few students graduate) or too low (too many unqualified students graduate) are far-reaching and very serious.

281

282  Part V 

■ 

Fair Testing

FAMILY EDUCATIONAL RIGHTS AND PRIVACY ACT: WHAT IT IS AND HOW IT WORKS You know that there is more recorded information on just about every U.S. citizen than you can shake a stick at. And although much of this is public, much of it is private and open only to the individual or their parents or caretakers. That’s the case with school information, and that’s the reason for the Family Educational Rights and Privacy Act (FERPA), also known as the Buckley Amendment, passed in 1974. The law applies to any public or private elementary, secondary, or postsecondary school and any state or local education agency that receives federal funds. Because almost all public schools and virtually all private schools receive some sort of federal funding, the law applies. It basically says that parents have certain rights to their children’s education records, and that these rights transfer to the student when he or she becomes 18 or begins educational programs beyond high school. And as you might suspect, most of this information has to do with test scores and other assessment outcomes. In sum, here’s what the law says: 1. Parents or students can inspect the students’ education records, but schools are not required to provide copies of records unless it is impossible for parents or eligible students to review the records (such as if they have to travel a long distance, which may be the case if rural records are centralized). 2. Parents or eligible students can request that their school correct records that the parents or students think are inaccurate. And if the school decides not to change a record, a formal hearing could be the next step. If the school still decides not to change the record, then a statement about the discrepancy must be placed in the student’s permanent record. 3. Schools must have written permission from the parent or eligible student to release any information from a student’s education record. But schools can share records with the following and for the following reasons: a. School officials with a legitimate educational interest, such as colleges b. Other schools to which a student is transferring c. Specified officials for audit or evaluation purposes d. Appropriate parties in connection with financial aid to a student e. Organizations conducting certain studies for or on behalf of the school f. Accrediting organizations

Chapter 16 

■ 

Laws, Ethics, and Standards  

g. Appropriate parties in compliance with a judicial order or lawfully issued subpoena h. Appropriate officials in cases of health and safety emergencies i. State and local authorities within a juvenile justice system, pursuant to specific state law Good idea? Seems like it. No one wants their private information shared with commercial interests, which had been happening. But there are some not-so-savory aspects to the law as well. First, a huge amount of paperwork is generated by the implementation of the law. Second, lots of questions were left unanswered by the original law, such as secondary school students’ access to college letters of recommendation (the law has since been amended to prohibit this), and concerns were voiced back in the day when the Selective Service had access to student records (to be used for recruitment!). Even though this law is decades old, it’s still in play and still needs to be fine-tuned. On the upside, it is the Buckley Amendment that keeps your professor from telling other students what grade you got on the big test or (and believe it or not, college professors used to do this all the time) post a big list of everyone’s grades on their office door!

THE RIGHT WAY TO DO RIGHT It’s another one of those no-brainers: People who are the subject of testing should be treated fairly, treated with dignity, and in no way harmed. Unfortunately, the professions that use testing as a tool (everything from physicians to psychologists) have slipped along the way, which makes this section of the chapter very necessary to read through.

From Whence We Came The best place to begin is with a little history. Take a look at these milestones as events that have helped shape the topic of ethical practices. Here are some dates and important events, all courtesy of the National Institutes of Health, and the full version can be found at https://www.niehs.nih.gov/research/ resources/bioethics/timeline/index.cfm. We are including selected dates and events that especially relate to collecting data about people, which when you think about it, is another term for tests and measurement! The list was prepared by David Resnick, a bioethicist (who, coincidentally, was Bruce’s Halloween costume last year), and most of the wording is theirs. Words in parentheses are ours. • 1620

Francis Bacon publishes The Novum Organon, in which he argues that scientific research should benefit humanity.

283

284  Part V 

■ 

Fair Testing

• 1830

Charles Babbage publishes Reflections on the Decline of Science in England, And Some of Its Causes, in which he argues that many of his colleagues were engaging in dishonest research practices, including fabricating, cooking, trimming, and fudging data.

• 1874

Robert Bartholomew inserts electrodes into a hole in the skull of Mary Rafferty caused by a tumor. He notes that small amounts electric current caused bodily movements and that larger amounts caused pain. Rafferty, who was mentally ill, fell into a coma and died a few days after the experiment.

• 1912

Museum curator Charles Dawson discovers a skull in at Piltdown gravel bed near Surrey, U.K. It was thought to be the fossilized remains of a species in between humans and apes, “a missing link.” A controversy surrounded the skull for decades and many scientists believed it to be fake. Chemical analyses performed in 1953 confirmed these suspicions by showing that the skull is a combination of a human skull and orangutan jaw, which had been treated with chemicals to make them appear old. The identity of the forger is still unknown, though most historians suspect Dawson.

• 1939–1945

German scientists conducted morally abominable research on concentration camp prisoners, including experiments that exposed subjects to freezing temperatures, low air pressures, ionizing radiation and electricity, and infectious diseases; as well as wound-healing and surgical studies. The Allies prosecuted the German scientists for war crimes in the Nuremberg Trials. The Nuremberg Code provided the legal basis for prosecuting the scientists.

• 1944–1980s

The U.S. Department of Energy sponsors secret research on the effects of radiation on human beings. Subjects were not told that they were participating in the experiments. Experiments were conducted on cancer patients, pregnant women, and military personnel.

• 1945

Vannevar Bush writes the report Science: The Endless Frontier for President Roosevelt. The report argues for a major increase in government spending on science and defends the ideal of a self-governing scientific community free from significant public oversight. It advocates for investment in science and technology as a means of promoting national security and economic development.

Chapter 16 

■ 

Laws, Ethics, and Standards  

• 1947

The Nuremberg Code, the first international code of ethics for research on human subjects, is adopted.

• 1948

Alfred Kinsey publishes Sexual Behavior in the Human Male. Five years later, he publishes Sexual Behavior in the Human Female. These books were very controversial, because they examined topics which were regarded as taboo at the time, such as masturbation, orgasm, intercourse, promiscuity, and sexual fantasies. Kinsey could not obtain public funding for the research, so he funded it privately through the Kinsey Institute.

• 1956–1980

Saul Krugman, Joan Giles and other researchers conduct hepatitis experiments on children with intellectual disabilities at The Willowbrook State School. They intentionally infected subjects with the disease and observed its natural progression. The experiments were approved by the New York Department of Health.

• 1950s–1963

The CIA begins a mind control research program, which includes administering LSD and other drugs to unwitting subjects.

• 1961–1962

Stanley Milgram conducts his “electric shock” experiments, which proved that people are willing to do things that they consider to be morally wrong when following the orders of an authority. The experiments, which had several variations, included a learner, a teacher, and a researcher. The learner was connected to electrodes. If the learner gave an incorrect response to a question, the researcher would instruct the teacher to push a button on a machine to give the learner an electric shock. Teachers were willing to do this even when the dial on the machine was turned up to “dangerous” levels and the learners were crying out in pain and asking for the experiments to stop. In reality, no shocks were given. The purpose of the experiments was to test subjects’ willingness to obey an authority figure. Since then, other researchers who have repeated these experiments have obtained similar results. (The obvious mental distress the teachers experienced would be regarded as harmful today and similar experiments are conducted differently now.)

• 1964

The World Medical Association publishes Declaration at Helsinki, Ethical Principles for Research Involving Human Subjects. The Helsinki Declaration has been revised numerous times, most recently in 2013.

285

286  Part V 

■ 

Fair Testing

• 1973

After conducting hearings on unethical research involving human subjects, including the Tuskegee Institute study (read about that here—https://www .cdc.gov/tuskegee/timeline.htm), Congress passes the National Research Act in 1973, which President Nixon signs in 1974. The act authorizes federal agencies (e.g., the NIH and FDA) to develop human research regulations. The regulations require institutions to form Institutional Review Boards (IRBs) to review and oversee research with human subjects.

• 1981

John Darsee, a postdoctoral fellow at Harvard, is accused of fabricating data for his heart research. Ultimately, dozens of his published papers were retracted.

• 1982

William Broad and Nicholas Wade publish Betrayers of Truth. The book claims that there is more misconduct in science than researchers want to admit and suggests that famous scientists, including Isaac Newton, Gregor Mendel, and Robert Millikan were not completely honest with their data. Their book helps to launch an era of “fraud busting” in science.

• 1987

A NIMH panel concludes that Steven Breuning fabricated and falsified data in 24 papers. (His work was on the use of drugs to treat hyperactivity in students with intellectual disabilities.) Breuning is convicted of defrauding the federal government in 1988.

• 1989

The NIH requires that all graduate students on training grants receive education in responsible conduct of research.

• 1994

Harvard psychologist Richard Herrnstein and Charles Murray publish The Bell Curve, a controversial book (that suggests that one reason that intelligence tests often find differences between races is that there are biological differences in intelligence). It reignites the centuries old debate about biology, race and intelligence. Most researchers believe the authors wrongly minimized the powerful role of environment—culture, experiences, educational opportunities—in explaining scores on intelligence tests.

• 1994

Roger Poisson admits to fabricating and falsifying patient data in NIHfunded breast cancer clinical trials in order to allow his patients to qualify for enrollment and have access to experimental treatments.

Chapter 16 

■ 

Laws, Ethics, and Standards  

• 1994

Two scientists who worked at Philip Morris, Victor DeNobel and Paul Mele, testify before Congress about secret research on the addictive properties of nicotine. If the research had been made public, the FDA or Congress might have taken additional steps to regulate tobacco as a drug. Many states and individuals brought litigation against tobacco companies, which led to a $206 billion settlement between tobacco companies and 46 states. The scientific community also publishes more data on the dangers of second-hand smoke.

• 1995–2005

Dozens of studies are published in biomedical journals which provide data on the relationship between the source of research funding and the outcomes of research studies, the financial interests of researchers in the biomedical sciences, and the close relationship between academic researchers and the pharmaceutical and biotechnology industries.

• 1999

The U.S. NIH and OHRP require all people conducting or overseeing human subjects research to have training in research ethics.

• 2000

The U.S. Office of Science and Technology Policy finalizes a federal definition of misconduct as “fabrication, falsification or plagiarism” but not “honest error or differences of opinion.” Misconduct must be committed knowingly, intentionally, or recklessly.

• 2005

University of Vermont researcher Eric Poehlman admits to fabricating or falsifying data in 15 federal grants and 17 publications. Poehlman served a year and a day in federal prison and agreed to pay the U.S. government $180,000 in fines.

• 2010

Lancet retracts a paper for fraud, published in 1998 by Andrew Wakefield and colleagues, linking autism to vaccines for measles, mumps, and rubella. Members of the anti-vaccine movement cited the paper as proof that childhood immunizations are dangerous. Vaccination rates in U.K., Europe, and the U.S. declined after Wakefield’s study was published. Wakefield’s research had been supported by a law firm that was suing vaccine manufacturers. (No other published study has found a link between autism and vaccines.)

287

288  Part V 

■ 

Fair Testing

Almost every institution has a board (usually called the Institutional Review Board, or IRB) whose sole purpose is to review research plans and proposals to ensure that the possibility of any ethical violations is minimized. Usually, external funding agencies require a review and a pass from the board for the funding to occur. The boards are made up of researchers and members of the public. If you are doing research as a part of your college work, find out what your IRB obligations are. As you can see, the primary ethical concerns when it comes to measuring humans has to do with not harming them and not lying about the results. Here’s a summary of the key ethical principles when it comes to tests and measurement: 1. Nothing should be done that harms the participants physically, emotionally, or psychologically. All testing formats and content should be carefully screened to be sure that such threats are eliminated or, if that’s not possible, minimized. For example, if you need to ask difficult questions about trauma or money problems or bad adult relationships, do so with great care and consideration. 2. When behavior is assessed, especially in a research setting, the test takers should provide their consent. And when children or adults who are incapable of providing such consent are used, then someone who cares for them (a parent, a relative of record, and so forth) should provide that consent. 3. If incentives are offered (such as paying people to complete a survey or an interview), the incentives should be reasonable and appropriate. For example, paying an adolescent $100 for 5 minutes of their time to talk about drug use is surely not appropriate. It seems coercive. 4. Unless there is a necessity otherwise (like when testing students or clients), test takers’ responses should be anonymous. 5. Not only should the research materials be anonymous, but as the researcher in your testing and assessment activity, you must ensure that the records will be kept in strictest confidentiality. 6. Although you want to share things with participants, you have to be judicious in your reporting. If you are a licensed school psychologist and you are administering a personality test, you want to be careful how much information you share with the child versus how much you share with the parents. You want to be informative and helpful but not provide more information than the child needs to know, because the information may end up being hurtful. 7. Assessment techniques should be appropriate to the purpose of the testing and appropriate to the audience being tested. What this means is that if you can’t find the right tool to use, then build your own or don’t use any.

Chapter 16 

■ 

Laws, Ethics, and Standards  

8. Tests that are constructed under your watch need to have all the qualities of any assessment tool that we recognize to be most important. Put simply, they should be valid and reliable. 9. Finally, what you are doing needs to have an important purpose. Frivolous testing is exploitive and unfair.

HOW THE PROS ROLL There is a big overlap between what is legal and what is ethical, of course. And both of these codes of behavior are similar to standards of professional practice. People who use tests and measurement as part of their job wish to behave legally, one would assume, and ethically, like all of us humans, but they also wish to perform competently when they do their work. In fact, one thing that makes a job a profession is that there is a set of standards for training and performance that have been established, usually by a group of people in that profession. Professional organizations often develop their own ethical codes, but they also follow a set of standards for performance that protect the reputation of the profession and promote quality work in their field. The three major professional organizations that worry about the right way to use tests and measurement are the American Psychological Association, the American Educational Research Association, and the National Council on Measurement in Education. They have combined to write, maintain, and publish the Standards for Educational and Psychological Testing and have been doing so for about 60 years. Basically, this book provides the standards that all professionals involved in testing should follow in the development, selection, and interpretation of tests. This is true for the classroom teacher, criminal psychologist, or neuropsychologist. What is being evaluated does not matter within the framework of our discussion here, but how it is being evaluated and what is done with the results are of paramount importance.

What’s in the Standards? You can order the standards (and the book is reasonably priced, not “textbook priced”) wherever books are sold, in real life or online (and you can do it directly at www.aera.net or www.apa.org), but we will summarize the content here and include some examples of some of the most important standards. There are three parts. Part I focuses on the key principles of validity, reliability, and fairness; Part II provides the professional standards for making tests and administering them; and Part III discusses the use of tests in various contexts, like schools, psychology, and the workplace. Should the standards be required reading at the level of coursework you are completing? No. Should you know that there are far-ranging and important standards

289

290  Part V 

■ 

Fair Testing

that everyone interested in tests and measurement knows about and should try to adhere to? Yes. And if you are training to be a professional measurement-type of person, you should probably have this sitting on your bookshelf. (Ask your instructor what a bookshelf is and what they were used for in the good old days). Let’s look at a few of the standards in the book. Standard 3.4 (in the chapter titled, “Fairness in Testing”) is this: “Test takers shall receive comparable treatment during the test administration and scoring process.” And Standard 3.13: “A test should be administered in the language that is most relevant and appropriate to the test purpose.” Now, you may see these two examples as being quite obvious, but spelling them out clearly and making them available to all goes a long way toward ensuring that they are adhered to and that no one can claim they didn’t know about them and their importance. Remember from Chapter 4 and our discussion on validity when immigrants in the 1800s were judged to have low intelligence because they didn’t speak English! That might seem so silly, but many today have similar thoughts, right? Those in the large and diverse testing community can certainly benefit from seeing the same information on a particular topic and, of course, being able to specify that these (as well as other) standards were followed. Here are two other examples, from the chapter, “The Rights and Responsibilities of Test Users.” Standard 9.5 reads, “Test users should be alert to the possibility of scoring errors and should take appropriate action when errors are suspected.” And Standard 9.10: “Test users should not rely solely on computer generated interpretations of test results.” Again, these may seem obvious to you, but the fact that they are disseminated to a large audience of those involved in the testing world means that they are very likely to be attended to and that few test developers, administrators, scorers, and others can claim that they did not know what the best practices were. Did you ever automatically follow the instructions of the GPS mapping app on your phone and almost get lost or in an accident? If so, you can see how reminding test users to not automatically use a computer-generated test score interpretation might be useful.

AND MORE STUFF TO BE CONCERNED ABOUT (NO, REALLY) There are tons of issues that relate to testing in general that you should be aware of—controversies, new theories, and ideas that have been discredited but just seem to have an ongoing life of their own.

Chapter 16 

■ 

Laws, Ethics, and Standards  

What follows is a very brief review of these issues, just so you are aware of them. Many of them deal with the assessment of intelligence, but all have relevance for areas of tests and measurement. By the way, they make great starting points for a discussion. So here goes with four of the most interesting and most entertaining issues but by no means all—that, we could spend hours or days on.

The Flynn Effect: Getting Smarter All the Time Here’s the big question: Do we get “smarter” as we get older? Our parents certainly don’t think so, but there sure are some interesting results that don’t suggest any clear answer. Much of this is based on the work of James Flynn, who in 1994 published a study that showed that scores on IQ tests over the past 60 years increased from one generation to the next (between 5 and 25 points). It’s no surprise that this phenomenon is called the Flynn effect. What are some possible explanations for this? One surely has to be that test takers are actually smarter than earlier generations. That’s an entirely feasible explanation. Flynn thinks that people in general are becoming better problem solvers. Others believe that increased education and exposure to new ideas is the reason for the IQ increase. Generations ago, there was no Sesame Street and no preschool and no internet (insert your own observation here about how the Web has actually made us dumber). And others believe that better nutrition is the answer. Average height increases with each generation (as does life span), so why not intelligence scores as a result of better nutrition early in life. Why the big deal? Well, if scores continue to increase, it means that the tests have to be renormed or restandardized every few years so that there will be consistent and accurate standards across all test takers. This is expensive, time-consuming, and controversial.

Teacher Competency: So You Think You’re Ready for the Big Time? It’s a common lead story in the New York Times and Wall Street Journal (and it’s always all over cable news). Because of budget pressures, governors and several state officials are considering doing away with the traditional tenure system for public school teachers where critics report it is almost impossible to replace a teacher. Of course, at the core of this issue is assessing the competency of a teacher, and is that ever a difficult task. How does one go about assessing competence in such a professional when everything from interaction with students, knowledge of a particular subject area, and ability to execute this area of knowledge needs to be carefully weighed and considered? Ever have a terrific teacher? Ever try to verbalize why you thought they were terrific?

291

292  Part V 

■ 

Fair Testing

Could we create a test that evaluates teacher skill? Sure, probably, and some have tried. But the problem is that most states who want to evaluate teachers use tests that weren’t designed for that purpose, like the No Child Left Behind statemandated tests their students take. And by now, you know that it’s not valid to use a test for other than its intended purpose.

School Admissions: Sorry, No Room This Year Welcome to the third controversy. Just what criteria should be used to admit students to schools? For example, should college admission to a publicly supported state school be open regardless of high school performance? Should there be special criteria for some students, based on race, gender, experience, or social class, for admission to graduate and professional schools such as medical or business? Lots of this discussion started in the mid-1970s when the University of California medical school reserved around 15% of its admissions for students who were disadvantaged (opening a can of worms in efforts to define that word, for sure). One student in particular, Allan Bakke, applied in 1973 and 1974 and was rejected even though his entry test scores were higher than other applicants’. Claiming what came to be known as reverse discrimination, he took the school to court, claiming that the admissions policy was discriminatory because it was based in part on race (because a disproportionate number of disadvantaged students were other than white). Well, a couple of court decisions later (in the Superior Court of the State of California and the California Supreme Court), the Supreme Court of the United States ruled in a 5-to-4 decision that Bakke was discriminated against and had to be admitted. That was the end of Bakke’s personal controversy but surely not the end of the controversy over preferential admissions and hiring of job candidates based on factors other than qualifications and performance.

Cyril Burt: Are We Born With It? One of the hallmarks of science is that results be considered truthful, regardless of the topic, the training of the researcher, and just about any other condition. You may have seen in recent years how the veracity of some data was brought into question. Well, this questioning of data’s provenance and validity (in every sense of the word) has been going on far longer than just a few years or decades. One luminary in the field of educational psychology, Cyril Burt, was accused in 1976 in peer-reviewed journals of fabricating data (he died in 1971). Burt was best known for his work in the specialized statistical technique of factor analysis and his investigations of the role that genetics or inheritance plays in intelligence. He very cleverly designed a study where twins were examined for

Chapter 16 

■ 

Laws, Ethics, and Standards  

similarities in intelligence. He compared those findings to less “related” siblings, as well as to other pairs of children, and using correlational analysis, concluded that indeed genetics plays a role, since the more related children were, the more strongly their intelligence scores would be related. Upon Burt’s death, several scientists started to investigate his use of twins in his research. These scientists had some doubts about the statements Burt had made about the inheritability of intelligence. These scientists tried to locate Margaret Howard and Jane Conway, both (claimed) research assistants of Burt’s, but neither could be located, nor could records of their work with Burt. This led to other investigations about the surprising consistency of the findings that Burt reported. They were just too consistent (remember real-life research has a lot of randomness to it) and too good to be true. Also, critics doubted whether anyone could find and assess as many twins as Burt claimed he did. But his work was not without supporters who could explain the consistencies. Today, his place in the study of intelligence remains controversial, although his work on statistical techniques continues to be useful.

Summary So many rules and regulations! And so many ethical standards! Tests are a big deal and affect the lives of us all. Some “tests” don’t really matter too much, such as a quiz on whether you are a good or bad friend that you might find in a magazine at the checkout stand. But some tests are truly high stakes and how you score can affect your present and your future. Think IQ tests and college admissions exams and high school graduation tests and lie detector tests and psychological evaluations. All of us should sleep better at night knowing that, at least in this case, professionals in measurement and government officials and concerned citizens are looking out for us. That’s nice. At absolute best, all the legal issues that surround testing are a challenge. And in the wisdom of our governance system and measurement and testing professional organizations, there are laws and rules that some of us feel are right on target in terms of being fair, whereas others believe that they are too encroaching on individual freedoms or even commercial interests. The answers? There are none. Surprised? Don’t be. In these complex times, such complex matters need to be addressed by all parties, including test makers, test takers, and those who make the laws, to help ensure that questions of fairness and equity are raised and at least addressed, if not fully answered. What will make you a professional who uses tests, as opposed to an amateur who uses tests, is that you are aware of the issues, the different sides to the debates, and the rules that other professionals have suggested. 

293

294  Part V 

■ 

Fair Testing

Time to Practice 1.

Contact a local, state, or federal representative (be it a senator, member of Congress, or legislative aide) and ask that person how they think NCLB has benefited children in the United States. This should take about 4 weeks from the time of inquiry to an answer. Once that amount of time has passed, compare your answers with your classmates’ answers.

2. What are some of the advantages of IDEA? What does the law have to do with inclusion? 3. What are the relative merits (and demerits) of Truth in Testing laws such as the one in New York State? 4. If you were the director of testing for your college, what would be five principles on which you would operate to ensure that testing procedures were fair? Don’t be afraid to use what you have learned in other chapters, and think broadly. 5. Visit your school’s home page online and search, either in a search box or an A-to-Z index, for information on your school’s institutional review board. You may need to search for “Center for Research” or “Human Subjects Committee.” Find the email address or phone number for contacting the review board and share it with your class. What are some of the important standards your review board upholds? 6. If you needed to follow just one ethical guideline for how tests are used, which would you choose? Explain why. 7.

The standards for testing and measurement were developed by both the American Psychological Association and the American Educational Research Association, so they must agree about how tests should be used. But are there differences in how these different professions do actually use tests?

Want to Know More? Further Readings •

Aldersley, S. (2002). Least restrictive environment and the courts. Journal of Deaf Studies and Deaf Education, 7, 189–199.

Want more history? This paper provides a description and an analysis of how the federal courts in the United States have interpreted the least restrictive environment clause in IDEA. It focuses on deaf children and traces the legislation and its impact. •

Eignor, D. R. (2013). The standards for educational and psychological testing. In K. F. Geisinger, B. A. Bracken, J. F. Carlson, J.-I. C. Hansen, N. R. Kuncel, S. P. Reise, & M. C. Rodriguez (Eds.), APA handbook of testing and assessment in psychology, Vol. 1. Test theory and testing and assessment in industrial and organizational psychology (pp. 245–250).

How do the standards for professional competency change over time? This book chapter examines the history of the Standards for Educational and Psychological Testing. One change over time the author identifies is that the audience for the Standards has broadened considerably.

Chapter 16 

■ 

Laws, Ethics, and Standards  

And on Some Interesting Websites •

Take a look at the National Center for Fair and Open Testing at http://www.fairtest.org/. This site is full of information about standardized testing and is very opinionated in its presentation, but you will certainly learn to identify the important issues surrounding testing.



Right from the federal government, a summary of the Every Student Succeeds Act at https:// www.ed.gov/essa. Like it or not, here’s where you need to go to get an accurate accounting of what’s involved.



The State of North Carolina (among others) has a code of ethics for the use of tests in its schools. Does your state? Find out what rules apply in North Carolina at https://center.ncsu .edu/ncaccount/pluginfile.php/1543/course/section/564/Testing%20Code%20of%20Ethics.pdf or by searching for “North Carolina Testing Code of Ethics.”

And in the Real Testing World Real World 1 In terms of real-world measurement and testing in the classroom, what does it mean to design classroom assessment that will be “fair”? Scholars in Canada and Iran collaborated on a paper that attempts to make concepts of fairness more concrete, especially in the context of how teachers assess their own students. The authors distinguish between justice and fairness and argue that it is not only important philosophically to be fair, but fairness actually can affect student learning. Want to know more? Rasooli, A., Zandi, H., & DeLuca, C. (2019). Conceptualising fairness in classroom assessment: Exploring the value of organizational justice theory. Assessment in Education: Principles, Policy & Practice, 26(5), 584–611.

Real World 2 Accountability movements in education, the implementation of standards, and progressive changes in how we assess learning have been motivated by, among other factors, concerns about performance and opportunity gaps between majority populations of students and traditionally marginalized or vulnerable populations, such as children with disabilities and those from culturally and linguistically diverse backgrounds. The authors review historic attempts at equality and conclude that, so far, there is little evidence of positive change. Want to know more? Cramer, E., Little, M. E., & McHatton, P. A. (2018). Equity, equality, and standardization: Expanding the conversations. Education and Urban Society, 50(5), 483–501.

Real World 3 Most of us either don’t want to admit they exist or are totally perplexed about what to do with them. We’re talking about the education of homeless children. This article summarizes the educational rights of homeless children and youth afforded by the Stewart B. McKinney Homeless Assistance Act of 1987. The author describes the educational problems these children have to deal with, including multiple movements between schools and barriers to parental involvement, to name just two of many. This article is several decades old, but, sadly, is still relevant.

295

296  Part V 

■ 

Fair Testing

Want to know more? Rafferty, Y. (1995). The legal rights and educational problems of homeless children and youth. Educational Evaluation and Policy Analysis, 17, 39–61.

Real World 4 Things are always changing in the social and behavioral sciences, and legislation that drives policy that drives practice is not any different. The Every Student Succeeds Act (ESSA) is meant to help ensure that all students achieve at high levels, especially children from low-income families. This article discusses the principles underlying the law and the obligation of states to clarify what they expect students to learn, the expectations for schools, the requirement that states assess regularly, and the requirement that information about schools, including assessment results, be made available to educators, students, parents, and communities. Want to know more? Chenoweth, K. (2016). ESSA offers changes that can continue learning gains. Phi Delta Kappan, 97(8), 38–42.

APPENDICES

297

APPENDIX A The Guide to Finding Out About (Almost) Every Test in the Universe

O

kay, not every test but just about every one you will need.

There are several places where you can find information about the tests mentioned in this book and the thousands of other tests that are available. First, you always have the volumes and volumes of information available in the library, including professional journals and books. Scholarly articles are published all the time that look at the validity and reliability of tests and instruments and measurement approaches. Here are three that we found in just a few minutes of searching on the Web. There IS a big old reference book that comes out every few years in hard copy—The Mental Measurements Yearbook. Here’s the citation for the latest edition: Carlson, J. F., Geisinger, K. F., & Jonson, J. L. (Eds.). (2021). The twenty-first mental measurements yearbook. Buros Center for Testing, University of Nebraska–Lincoln. The MMY (as it’s called by insiders like you and your authors) contains reviews of hundreds of tests—all the famous ones and the not-so-famous ones—and speaks specifically to their validity and reliability and other psychometric properties. It’s almost surely in your library, or they can get it through interlibrary loan. It’s about 1,000 pages and costs about $200, and you can see a sample review in Figure A.1. And you can find even more information in Tests in Print IX (also from the Buros Center for Testing at the University of Nebraska). It’s a massive (more than 1,200 pages) list of every known published test. Just a bibliography (which also includes some scholarly publications about testing and a list of reviews of tests) but as comprehensive as you could wish for. Here’s the citation and your library probably has this too: Anderson, N., Schlueter, J. E., Carlson, J. F., & Geisinger, K. F. (Eds.). (2016). Tests in print IX. Buros Center for Testing.

299

300  

Tests & Measurement for People Who (Think They) Hate Tests & Measurement

FIGURE A.1 

Part of a Sample Review From the MMY

The Buros Center for Testing covers 18 categories of testing, with hundreds of tests in each category: • Achievement • Behavior assessment • Developmental • Education • English and language • Fine arts • Foreign languages • Intelligence and general aptitude • Mathematics • Miscellaneous • Neuropsychological • Personality • Reading • Science

Appendix A  

• Sensory-motor • Social studies • Speech and hearing • Vocations How’s that for a nice bunch of choices?

ALL THE BIGGIES Want a list right here and now of pretty much all the major commercially or professionally produced tests and assessments? Here’s one we put together:

Tests You Take During Kindergarten Through 12th Grade • ACT American College Test • COOP Cooperative Admissions Examination Program • GED General Educational Development Test • HiSET High School Equivalency Test • HSPT High School Placement Test • ISEE Independent School Entrance Examination • PSAT Preliminary Scholastic Aptitude Test • SAT (formerly known as the Scholastic Aptitude Test) • SAT Subject Tests • SHSAT Specialized High School Admissions Test • SSAT Secondary School Admission Test • TASC Test Assessing Secondary Completion

Tests You Take as You Start College • CSU ELM California State University Entry-Level Mathematics • CSU EPT California State University English Placement Test • IELTS International English Language Testing System • TOEFL Test of English as a Foreign Language • TOEIC Test of English for International Communication

301

302  

Tests & Measurement for People Who (Think They) Hate Tests & Measurement

Tests You Take to Get Into Graduate School • CBEST California Basic Educational Skills Test • DAT Dental Admissions Test • GMAT Graduate Management Admission Test • GRE Graduate Record Examination • LSAT Law School Admission Test • MAT Miller Analogies Test • MCAT Medical College Admissions Test • OAT Optometry Admission Test • PCAT Pharmacy College Admission Test • WTMA Wiesen Test of Mechanical Aptitude

Intelligence Tests • Cattell Culture Fair • DAS Differential Abilities Scales • Draw-a-Person Test • Kaufman-ABC Assessment Battery for Children • Leiter International Performance Scale • Miller Analogies Test • Multidimensional Aptitude Battery II • Otis–Lennon School Ability Test • Raven’s Progressive Matrices • Stanford–Binet Intelligence Scales • Sternberg Triarchic Abilities Test • WAIS Wechsler Adult Intelligence Scale • WISC Wechsler Intelligence Scale for Children • Wonderlic Test • Woodcock–Johnson Tests of Cognitive Abilities • WPPSI Wechsler Preschool and Primary Scale of Intelligence

Appendix A  

Personality and Psychological Tests • BDI Beck Depression Inventory • Bem Sex-Role Inventory • California Psychological Inventory • DSM Diagnostic and Statistical Manual of Mental Disorders • EQSQ Test • Eysenck Personality Questionnaire • Forte Communication Style Profile • Hand Test • Hare Psychopathy Checklist • HBDI Herrmann Brain Dominance Instrument • HEXACO Model of Personality Structure • Holland Codes (RIASEC) • Inwald Personality Inventory • IPIP International Personality Item Pool • Keirsey Temperament Sorter • MBTI Myers-Briggs Type Indicator • MCMI Millon Clinical Multiaxial Inventory • MMPI Minnesota Multiphasic Personality Inventory • NPA Newcastle Personality Assessor • Revised NEO Personality Inventory • Robin Hood Morality Test • Rorschach inkblot test • 16PF Sixteen Personality Factor Questionnaire, or 16PF Questionnaire • Swedish Universities Scales of Personality • TAT Thematic Apperception Test • Taylor-Johnson Temperament Analysis • Temperament and Character Inventory • Thomas–Kilmann Conflict Mode Instrument • True Colors Test • Woodworth Personal Data Sheet

303

APPENDIX B Answers to Practice Questions CHAPTER 1 1. How do we know what your memories of being tested are? We aren’t psychic. Though we have memories of being tested to see if we were. 2. Your answers will, of course, all be different from each other. But think about the times you’ve looked through journal articles in the past. Did you pay much attention to the Instruments sections? 3. Your answers will vary. But pay attention to how the answers differ based on who you ask. 4. Some ways we thought of: Sometimes tests are used as gatekeepers to keep people out of careers or opportunities. Sometimes tests are given when a professional has run out of ideas or information, like a doctor trying to diagnose a puzzling illness. Sometimes tests are designed to sell a product like an art course. Ever see in old comic books those “draw a pirate” ads? 5. We hope your instructor will allow time for you to share. 6. When you identified the hot topics in testing, did they match the summaries of other students?

CHAPTER 2 1. Levels of measurement are useful because they allow us to specify the precision with which a variable is being measured and to select or design instruments that assess that variable accurately. 2. For example, height can be measured in inches (interval or ratio) or in groups such as tall and short (nominal). And remember, since higher levels of measurement incorporate the qualities of lower levels, any variable that can be measured at one level can also be measured at the levels it subsumes. 3. Here you are: a. Nominal. These are categories to which you can belong only one at a time. 305

306  

Tests & Measurement for People Who (Think They) Hate Tests & Measurement

b.

Interval. This is the interval level of measurement, because you assume that the points along the underlying continuum are equally spaced, but some people would argue that it is not that precise. c. Ratio. Those folks can have no Volvos, right? Nothing is nothing is nothing. d. Interval. Equally spaced points, once again. e. Interval. You might think this is ratio in nature, but it’s hard to see how someone can run any distance in no time (which would make it ratio). 4. You’re on your own for this one. If you can’t find an article that you understand, don’t try to complete the assignment. Keep hunting for articles that you can first understand; otherwise, you’ll have a hard time answering the main part of the question. 5. The interval level of measurement provides more information because it is more precise. You want to use it because the more precise a level of measurement is, the more information you have available and the more accurate your assessment (all other things being equal). 6. Here’s an example. Let’s say you are a psychologist and you are interested in measuring depression. a. Nominal—separating study participants into three groups: those who have been tested and diagnosed as having depression, those who have been tested and diagnosed as not having depression, and those who have not been tested b. Ordinal—clients with depression are categorized as mild, moderate, or severe. c. Interval—scores from the Beck Depression Inventory 7. For example: • Nominal: Heat is hard to measure meaningfully using categories, but one could classify the score as being from the Fahrenheit, Celsius, or Kelvin scale. • Ordinal: cold, cool, warm, hot • Interval: 10°F, 48°F, 63°F, 98°F • Ratio: 277 Kelvin, 330 Kelvin 8. We would be using the ordinal level of measurement, as we are ranking the levels in terms of usefulness. 9. Here is an example. Research question: What is the relationship between parents’ education and children’s scores on the ACT test? The highest level of measurement for parents’ education could be interval, which

Appendix B  

could be measured in years of education completed. The highest level of measurement for children’s ACT scores would be interval. Not ratio, right, because all students have some amount of knowledge (the ACT is an achievement test).

CHAPTER 3  1. You’re on your own on this one, but be sure to find articles in your own discipline and ones you find of interest.  2. Of course, there are a lot of good possible answers here. For method error, we might a. Have a poorly reproduced test (change the toner in the copy machine). b. Be in a room that’s too cold (raise the thermostat). c. Be trying to work on a computer that won’t boot (use a different computer). For trait error, we might a. Party too late for too long (don’t). b. Fail to study (don’t). c. Study with the wrong study group (duh!).  3. Error, in effect, is unreliability. All those sources of error that make the true score more difficult to discern need to be addressed. And as they are addressed, the observed score becomes more reflective of the true score and reliability will be increased.  4. It would be very nice to obtain a reliability coefficient of 1.00 but somewhat unrealistic. Why? Because such a coefficient reflects absolutely no random error in the test-taking experience, and that’s an outcome that is very difficult, if not impossible, to expect or observe. Tests, for the most part, are imperfect, and human behavior is unreliable. And a value of less than +1.00 reflects that unreliable or unstable behavior.  5. a. Test–retest reliability b. The resulting coefficient is .82, pretty high and certainly indicating that there is a significant and strong correlation between the two testings and, hence, pretty good test–retest reliability.

307

308  

Tests & Measurement for People Who (Think They) Hate Tests & Measurement

6. A test is internally consistent when it “speaks with one voice”—that is, the items tend to measure the same construct, idea, or information. It is important when the goal of the test is to combine all the responses. If the test is made up of different parts, or subtests or subscales, it doesn’t make sense to expect the whole test to show internal reliability. 7. If the test you use is unreliable, you’ll never know if it’s valid (how can something do what it is supposed to do if it cannot do it consistently?). And your hypothesis will never be fairly tested. You will never know if the hypothesis is supported, because the instrument you used is unreliable and the results are untrustworthy.  8. a. Method b. Trait c. Trait d. Method 2rh , we multiply 2 by 1 + rh .68 to get 1.36 and divide it by 1.68 to get .81. This is very good to high reliability.

9. Using the Spearman–Brown equation of rt =

10. You could say, “People completed my measure at two different times, and their scores changed moderately between the first time and the second time they completed the measure. This finding means the measure did not provide scores that were as stable as I would have liked, and I will need to make some adjustments to it. Also, I had half my participants complete one version of the measure and half complete a different version. The parallel forms’ reliability coefficient of .86 means the versions were pretty comparable to each other.”

CHAPTER 4 1. Do this one on your own, but keep in mind that long-established-and -used tests often do not have any reports of these indexes, because they have been established and are common knowledge in the field. 2. There are lots of ways to do this, but you could follow these steps: a. Develop your test based on some theory that relates to the construct of shyness. b. Using some other well-established criterion, find people who are shy and people who are not and administer your test.

Appendix B  

c. Now, find out how the people in both of these groups scored on your test of shyness. Those who were in the shy group should be differentiated on the test from those in the nonshy group.

If so, you’ve done it.

3. To establish concurrent validity, you could compare scores on your measure of marital satisfaction with an existing measure of marital satisfaction. Everyone takes both measures. To establish predictive validity, you could follow up with your participants 5 years later and assess divorce rates among them. 4. A test cannot be valid unless it is reliable, and .49 is not a strong enough reliability coefficient. You cannot say a test is measuring what it is supposed to when it is providing unstable and inconsistent results. 5. You need to use a test that is both reliable and valid, because if your results do not support your hypothesis, you will never be sure whether the instrument is not measuring what it is supposed to or the hypothesis is faulty. 6. It’s just a matter of time (and money). If you have several more years to finish school, then this is fine, because it can take a very long time to develop almost any test, especially one that deals with complex social or psychological constructs. Instead, why not use an already established tool? 7. We may all differ on what we mean by the word clever, but really, the multitrait–multimethod way is indeed very well thought out. It operates on the assumption that similar traits measured in similar ways will be strongly related to one another, different traits measured using different methods will not be related at all, and everything in between (such as similar traits being measured with different methods) should be somewhat related. It allows for the evidence to converge when construct validity is present and diverge when not. Clever indeed. 8. Because the criterion is everything! The criterion, in theory anyway, should relate to whatever outcome you are trying to validate through the use of the test you are creating. 9. In one way or another, it is back to the drawing board. Perhaps your experts were not expert enough in judging the content validity of the items. Or the central construct on which the test was based is not theoretically sound— so how could you expect a measure of that construct to work and be valid? In any case, you have to reassess how you want to measure the variable of interest, the evidence on which the assessment tool is based. That’s the first step. Then the validity studies should be easier to do and do well.

309

310  

Tests & Measurement for People Who (Think They) Hate Tests & Measurement

CHAPTER 5 1. An advantage to reporting raw scores is that they are easy to understand and, in the case of achievement tests, often give the test taker some idea as to the number (easily converted to a percentage) that they got correct. As for disadvantages, there are many, but perhaps the most important is that a raw score by itself tells little about performance, either in absolute terms (because you don’t know what constitutes a good or excellent raw score) or relative to other test takers. 2. The formula is Pr =

3 × 100 = 15 20

A raw score of 65 corresponds to a percentile rank of 15. 3. Grace scored above the mean on both tests. 4. Well, there may be many things you like about z scores, but by far the most important thing is that once computed they are comparable across distributions. Both z and T scores can be easily, and effectively, compared with each other. It is also nice that norm-referenced scores tell us so much about how people in the same group compare to one another. 5. Here’s what we came up with: The owner of a radio manufacturing company needs to be sure that each person on the assembly line can perform their assigned tasks with a level of 98% accuracy. The rationale for using a criterion-referenced test is that the owner does not care whether one assembler is better than another, only that each performs at this defined level, which is the criterion. 6. It’s a trick question. On this test, Annie and Sue demonstrated equal strength. To convert Sue’s z score of 1.5 to a T score, we multiply it by 10, to get 15, and add it to 50, to get 65—the same T score as Annie’s. 7. Using the formula T = 50 + 10z, we get 40, 50, and 60, respectively. 8. You could say, “Your total score on the attention-deficit test places you one standard deviation above where the average respondent falls. This means that your score indicates a higher level of attention deficit than about 84 percent of the people who take this test.” Percentile ranks are more

Appendix B  

familiar to the average person than are references to T scores, and most people don’t know what a standard deviation is. 9. If very small, it shows that you are working with an assessment tool where the true and observed score tend to be similar, and that suggests good reliability. If the SEM is large, it is definitely time to reconsider the usefulness of the measure you are working with. A high SEM simply means that there is too much variability between the true and observed scores—an indication of unreliability. The scores people get on this test are essentially random.

CHAPTER 6 1. IRT stands for Item Response Theory and is a method that is used to examine patterns of responses across all items of a test and to estimate as accurately as possible a test taker’s true underlying ability. Its primary advantage over Classical Test Theory is that it defines item difficulty as a function of test takers’ ability and allows for decisions to be made about which items to use together. 2. IRT works by correlating individual item scores with total item score, and it repeats this analysis for every item and every test taker. The more items and test takers, the more accurately the test can reflect the true ability of each test taker. To draw an item characteristic curve accurately one needs lots of scores across a lot of different ability levels. 3. Your answers will depend on what article you found.

CHAPTER 7 1. Not only are there many different people involved (all of whom presumably get paid), but the item writing, scoring, rewriting, and norming process can also take years and cost many, many dollars. 2. Of course, your answer will be unique. 3. Do this one on your own.

311

312   Tests & Measurement for People Who (Think They) Hate Tests & Measurement

4. Your table might look something like this: Topic Difficulty

Fluids

Gases

Solids

Formulas

Total

Easy

7 items

4 items

2 items

1 item

14

Medium

3 items

7 items

5 items

7 items

22

Hard

2 items

9 items

2 items

1 item

14

Total

12

20

9

9

50

5. A biology teacher might decide that students need to answer 60 out of 100 questions (i.e., 60%) correctly on a final exam to pass it. This method of scoring is an example of criterion-referenced scoring. If the teacher were to decide to make the test norm-referenced, they might say that only those students who, compared with their classmates, score above the mean will pass the exam. (That’s a tough teacher!) 6. Each of these is an appropriate example of an achievement test, except for choice C, which is an example of a personality test. 7. a. Knowledge b. Application c. Evaluation d. Comprehension (or maybe even Analysis) e. Knowledge f. Knowledge

CHAPTER 8 1. Because how much a person knows today can predict how they will perform in the future. 2. This one is on your own. 3. One of the first steps is to identify those skills that are important to the selected profession and then create an item. For example, it’s very important for airline pilots to be able to read maps. Map-reading skills include being able to decipher symbols and understand directions and map coordinates (among many others). 4. The basic difference is the use to which the results are put. Many aptitude tests include items that could appear on an achievement test, but that’s not the point or the purpose. Aptitude tests look to future performance whereas achievement tests evaluate current ability.

Appendix B  

5. Because one way of defining aptitude is in how ready an individual is for a certain set of next steps, such as moving into a reading curriculum during elementary school or starting work as an accountant. 6. This one is on your own. Among many advantages of aptitude tests are their ability to assess current abilities, distinguish strengths from weaknesses, and predict future performance. One possible drawback is that some of these tests may be used for screening out candidates for a job, a school, or a program even though they only moderately predict future performance. 7. Employers can use aptitude tests to find out which candidates are most likely to perform well in a certain position or to assess an employee’s strengths and match the employee with commensurate responsibilities. Marathon runners might evaluate their potential early on for completing a marathon in a certain amount of time to find an appropriate pacing group for the big event. 8. For our example, we chose the Graduate Record Exam (GRE), which is designed to predict a potential student’s performance in graduate school. To test predictive validity, you could gather completed graduates’ past GRE scores and correlate them with their GPAs at graduation time.

CHAPTER 9 1-5. Sorry, for the first five practice questions—you’re on your own. There are no right or wrong answers, only the fun of thinking about some very interesting topics.   6. Among several possibilities are the following: because stopping at that point keeps the test taker from getting discouraged or fatigued (thereby affecting performance on future subtests), because it keeps the test from taking an inordinate amount of time, and because advancing likely will not provide any additional benefits.   7. Just you and the question again—no right answer.

CHAPTER 10  1. There are many different ways to answer this question, but personality is such an interesting topic because everybody has one, and there’s nothing any better or worse (on the whole) about having one type versus another. So it’s like a constant in all of our lives, and through an understanding of individual differences and such in personality, we can better understand human nature.

313

314   Tests & Measurement for People Who (Think They) Hate Tests & Measurement

 2. One way we discussed in this chapter is through the use of a criterion group. For example, if you were interested in studying shyness, you would locate people who are shy and people who are not and design items that discriminate between the two groups. Another approach is to start with a theory that defines the construct, such as depression, and then write items that are consistent with the theory.  3. An objective personality test uses structured stimuli and often closedended questions (for example, “I enjoy eating alone.”). A projective personality test uses very open-ended items where the test taker constructs a response based on the indefinite and ambiguous nature of the item. In both cases, there is no right or wrong—just a different way of accessing information about one’s distinct personality traits and characteristics. If one believes that respondents are aware of their feelings and will be honest, an objective approach is best; otherwise, a projective strategy is useful.  4. This one is on your own.  5. Have fun with this one, and be sure to keep your results confidential.  6. This one is on your own.  7. The Rorschach inkblot test is one of few personality tests normed for children younger than age 7 (see Table 10.1). Draw-a-person tests are useful for younger children, too.  8. Personality tests usually involve some sort of subjectivity and may represent a snapshot of an individual rather than an inflexible profile. Like every other test, they do not promise perfect reliability. Among several options for rounding out the picture painted by your personality test conclusions are clinical interviews, feedback sessions with the test taker, additional personality tests, behavioral observations in another environment, and verbal reports from family members or close friends.  9. As you probably know by now, there are several, and among them are using such tests to differentiate between different types of neurological disorders and also diagnosing neuropsychological disorders. These tests are all about seeing how your brain works. 10. The best answer may be that neuropsychological tests are administered in groups called batteries, since that is the only way to get a very good picture of the multidimensional relationship between brain and behavior. 11. There are several, but perhaps the most important is that forensic assessment has as its primary client the court or legal system while other types of assessment usually have as their client the individual.

Appendix B  

315

12. Do this one on your own, but be sure to share it with your colleagues and classmate buddies, since it should be fun to explore.

CHAPTER 11   1. Hope you had fun. Now be sure to review the five caveats by Richard Bolles we mentioned at the end of this chapter when you read the SDS report and think about the results.   2. Here’s the idea. There are people who have been in occupations who have certain traits and characteristics and also like a particular environment in which to work (at home, on the road, etc.). If one could find out what those characteristics and preferences are within each occupation, then one may be able to match those characteristics in other people who are not yet employed but show an interest in a particular area or activity.   3. Here are the Holland codes as applied to several professions. Check Off the Occupational Theme Occupation

Social

Enterprising

Conventional

Educational programming director

2

1

3

Golf club manager

3

1

2

Paralegal

1

2

3

3

2

Lathe operator Airplane flight attendant

Realistic

Investigative

Artistic

1 3

1

2

  4. Be sure to consider the type of person who might take such a job and why. For example, it should not be any surprise that the first occupational theme for a golf club manager is enterprising; that’s their job—to generate business and manage others.   5. On your own, and this question is easiest to answer if you select from a wide group of people you know who participate in a diverse group of professional activities. 6–8. These questions are exercises for you to complete on your own. Have fun, and feel free to keep searching around!

316   Tests & Measurement for People Who (Think They) Hate Tests & Measurement

CHAPTER 12 1. A multiple-choice question has three parts—the stem (which is the question part of the question), a keyed answer (which is the correct answer), and the wrong answers (which are called distractors). 2. There are at least two errors in this question. Stems should be complete sentences and a negative is used in the stem (“not”). By the way, otters, skunks, and wolverines are all in the weasel family, so A is the right answer. 3. Matching items have several strengths. For example, you can measure a lot of knowledge in a small space, and, when matching items are well written, guessing is much harder than with multiple-choice items because there are many more plausible distractors. 4. For assessing memorized information, it’s hard to beat a whole bunch of questions that can fit in a small space. And true–false items do that well. Sure, they are guessable, but if there are many of them, any lucky guesses will probably be canceled out by unlucky guesses. 5. Even though short answer and fill-in-the-blank formats are supply items (they can’t be easily guessed), the scoring is very objective, just like selection items. 6. There are a couple of mistakes with this fill-in-the-blank item. The blank should be at the end and the use of “an” gives away that the answer starts with a vowel sound. And it does—“equilateral.”

CHAPTER 13 1. Essay questions and writing assignments can be used to measure both writing ability and skills like critical thinking. They also can see if students have a deep, high-level understanding of a topic. 2. How’d you do? 3. The visual arts, music, and other creative endeavors. The reason is that these subject areas often have to be judged using subjective criteria, and if the portfolio is well designed, you can get at those skill areas validly and fairly. 4. Some of the advantages are as follows: a. They are flexible. b. They are highly personalized for both the student and the teacher.

Appendix B  

c. They are an alternative to traditional methods of assessment. d. They are possibly a creative method of assessment when other tools are either too limiting or inappropriate. 5. Because students literally have to build the answer—by planning and writing an answer, or putting together a birdhouse, or creating a sculpture, for example. 6. When students choose an answer for a selection item, they only have to recognize the correct response. There is no mastery, no creativity, no skill required (except maybe test-taking skill). With construction items, though, teachers can “see” the level of skill or ability students have. 7. Good rubrics assess many different parts or qualities of the performance or product. And the range of possible ratings or scores for each part are well defined. 8. Reliability can be improved by including multiple indicators (many scores that are summed together for a total score that is pretty precise) and, more importantly, concrete, directly observable criteria for quality. The whole idea is to lessen the amount of subjectivity in the scoring as much as possible.

CHAPTER 14 1. Benefits include generating reliability estimates, identifying items that need to be reworded or removed, getting information about how long it takes to complete, checking on the interpretation of questions, “beta testing” the technology used, and so on. 2. When a test is piloted, the researcher is mainly interested in the characteristics of the scores and responses in terms of validity and reliability. The actual level of the construct in the sample isn’t really important to them, until they use their instrument in research after it has been validated. 3. If people are afraid to admit some behavior or attitude, they may need encouragement to believe that it is “okay” morally. On the other hand, if a person is “innocent” of having a wrong thought or having done something considered bad, it is unlikely that they could be persuaded to confess to something they didn’t do (at least in the context of a research survey). 4. Both provide attitude statements and people are asked to agree or disagree with them. Both use several items and combine those responses into a total score from that “scale.”

317

318   Tests & Measurement for People Who (Think They) Hate Tests & Measurement

5. Responses to Likert-type items are on a continuum from disagree to agree and all items are worth the same amount of points. Responses to Thurstone items are either/or (one either endorses an option or does not) and items are worth different amounts of points based on how strongly they are worded. The Thurstone scales require an additional step during development, where “judges” weight the items based on their strength. 6. Social Exchange Theory says that if there is trust and low costs with high rewards, then a trade will happen. The trade between you and your instructor might be that they work hard to prepare and teach as well as they can and give you a good grade, if, in exchange, you agree to prepare and work as hard as you can. (You also might have to agree to laugh at their jokes.) Rewards for you could include a good grade, learning, skill development, and so on. The costs might be low for you if your instructor teaches well (so learning is easier), and the time and place and nature (online or in-person) of the course is convenient.

CHAPTER 15 1. You’re on your own for these, but try to examine the questions that are created very carefully, and in doing so, learn how to avoid as much bias in item construction and administration as possible. 2. Test bias has to do with the differential performance of groups on a test as a function of characteristics unrelated to the concepts or ideas being tested, such as race, gender, and social class. Test fairness includes bias but is more general and refers to whether the use of the test and the effect of testing on the test taker leaves them better off than they would have been if they never took the test in the first place. 3. The difference model of test bias says that if there is a difference in test scores between groups and group differences in ability do not actually exist, then the test is biased. The problem with this model is that there may be legitimate differences in performance between groups that are due to variables related to the construct. For example, teachers may treat girls differently from boys and that might affect their math achievement. It is not the biology of being assigned male or female at birth that affects math ability, it is the developmental environment. Don’t blame the test. 4. Here are five starters: a. Unfair hiring practices b. Inaccurate test results c. Inaccurate placement of individuals in groups d. Inadequate screening e. Inappropriate diagnoses

Appendix B  

5. We’ll leave this tough one to you and your teacher and classmates. Our answer would start, “Yes, but . . .” 6. Universal design promotes the idea that one tool should be usable by all—one entrance to a building, one website, and one test. So, needing different forms of the same test isn’t quite consistent with that particular philosophy. But it still might be a good idea!

CHAPTER 16 1. This one is on your own, and you can email, call, or write a snail-mail letter, but be persistent until you get some kind of answer. 2. The IDEA, foremost, ensures a fair, free, and appropriate education for all students, regardless of their level of disability. There are many other advantages (and some important costs, such as $$$!). Inclusion is an important philosophical principle that no children should be left out of educational opportunities, and the IDEA helps ensure that. 3. Among the merits are full disclosure so the public can see how tests are developed and validated, and what the correct answers are to test items, among others. Some of the demerits are that Truth in Testing increases the costs of the test to test takers and, if items are not used again, can threaten the validity of the test. 4. Here are just a few: a. Tests should be unbiased. b. Test scores should reflect performance accurately. c. Tests should be used for placement in such a fashion that they ensure an individual’s well-being. d. Tests should never be used to punish the test taker. e. Tests should be used only by qualified personnel. 5. You’re on your own for this one. 6. Of course, you have your own opinion on this one. But we are curious about which basic principle you chose—was it to protect the test taker, ensure honesty in reporting, ensure fairness, protect the profession, or something else? 7. Your actual answer doesn’t matter as much as taking the time to think about how tests are used and their purpose, both stated and otherwise. For instance, sometimes psychologists use tests just to start a conversation and educators tend to test because they want to assess their effectiveness.

319

GLOSSARY Ability or intelligence tests  Tests that measure intelligence (a very broadly defined construct)

Consequential validity  How tests are used and how their results are interpreted

Achievement tests  Tests that assess knowledge in a particular content area

Construct-based validity  A type of validity that examines how well a test score reflects an underlying construct

Aptitude tests Tests that evaluate an individual’s potential Asymptotic  The quality of the normal curve such that the tails never touch

Constructed-response items  Classroom assessment tasks that ask students to create a complex response, product, or written answer

Average  The most representative score in a set of scores

Content-based validity A type of validity where the test items sample the universe of items for which the test was designed

Basal age The lowest point on an intelligence test where the test taker can pass two consecutive terms that are of equal difficulty

Contextual intelligence A type of intelligence that focuses on behavior within the context in which it occurs and involves adaptation to the environment

Bell-shaped curve A distribution of scores that is symmetrical about the mean, median, and mode and has asymptotic tails

Convergent validity In the multitrait–multimethod model of establishing construct validity, when two or more measures of theoretically related constructs are actually related to each other

Ceiling age  The point on an intelligence test where at least three out of four items are missed in succession Classical Test Theory A simple description of the score on a test as being the result of a true score, which is the average score a person would get on a test if they took it many times, and an error score, which represents some randomness around that true score Closed-ended essay question  An essay question format where the respondent has very little freedom in terms of what content must be in the answer Coefficient of alienation  The proportion of variance unaccounted for in the relationship between two variables Coefficient of determination  The proportion of variance accounted for in the relationship between two variables Componential intelligence  A type of intelligence that focuses on the structures that underlie intelligent behavior, including the acquisition of knowledge Concurrent validity  A type of validity that examines how well a test outcome is consistent with a criterion that occurs in the present

Correlation coefficient A numerical index reflects the relationship between two variables

that

Criterion-based validity  The property of a test that reflects a set of abilities in a current or future setting Criterion contamination  When a criterion variable is related to what is being measured Criterion-referenced scores As contrasted with norm-referenced scores, scores where the interpretation of whether the score is “good” or not is based on whether it meets a predetermined standard or not Criterion-referenced test  One where there is a predefined level of performance used for evaluation Criterion-related validity  Another name for criterion validity Cronbach’s alpha or coefficient alpha  A measure of internal consistency Crystallized intelligence  The type of intelligence that involves the acquisition of information Cut score  A score on a test that has been predetermined as a standard or criterion. People above or

321

322   Tests & Measurement for People Who (Think They) Hate Tests & Measurement below a cut score are placed in some category of performance, like pass or fail Dichotomous  Refers to a variable that has only two possible scores Difficulty level  As used in Item Response Theory, it is the probability that a test taker will get an item correct Direct correlation A positive correlation where the values of both variables change in the same direction Discriminant validity In the multitrait–multimethod model of establishing construct-based validity, when two or more measures of theoretically unrelated constructs are actually unrelated to each other Discrimination level As used in Item Response Theory, it is how well an item distinguishes between test takers at different ability levels Distractors  Alternatives in a multiple-choice question that are not correct Education of Handicapped Children Act  A law signed in 1975 by president Gerald Ford that guarantees all children, regardless of disability, the right to a free and appropriate public education; now known as the Individuals With Disabilities Education Act, or IDEA Error score  The difference between a true score and the observed score Essay questions  Items where the test taker writes a multi­sentence response to a question Experiential intelligence  A type of intelligence that focuses on behavior based on experiences Factor  A collection of variables that are related to one another Factor analysis  A statistical technique that examines the relationship between a group of variables Family Educational Rights and Privacy Act (FERPA)  A federal law that protects the privacy of students and their test results Fill-in-the-blank items An open-ended test format where you are asked to put an answer in a blank to complete a sentence Fluid intelligence The type of intelligence that reflects problem-solving ability Forensic assessment The use of psychological assessment tools to assist in the legal process regarding fact finding and decisions

General-Factor Theory A theory about intelligence that proposes that a single factor is responsible for individual differences in intelligence High-stakes test  A testing situation where the results of the test are used to make important decisions Indirect correlation  A negative correlation where the values of variables move in opposite directions Individualized Education Program A component of Public Law 94-152A; written plan of educational goals and strategies for children who are referred for special programs Individuals With Disabilities Education Act (IDEA)  A federal law that guarantees a free and appropriate education for all children regardless of level of disability Internal consistency reliability A type of reliability that examines the one-dimensional nature of an assessment tool Interrater reliability  A type of reliability that examines the consistency of raters Interval level of measurement  A measurement system that assigns a value to an outcome that is based on some underlying continuum and has equal intervals Item characteristic curve (ICC)  A visual representation of an item as a function of the probability of a correct response as a function of the ability of the test taker Item Response Theory (IRT) An advance over Classical Test Theory that extends the definition of reliability as a function of the interaction between an item and the characteristics of the individual responding to the item Keyed answer  The correct answer to a test question Latent trait  A trait or characteristic that is not directly observable Least restrictive environment An environment that places the fewest restrictions on a child with disabilities Level of measurement The amount of information provided by the outcome measure Likert format  A popular format for items on attitude surveys where answer options are symmetrical (an equal number of positively and negatively perceived answer options) and balanced (an effort is made to make the “distance” between each answer option about equal)

Glossary  323

Liquid intelligence  A “type” of intelligence described by flexible problem-solving skill Matching items  Items where premises are matched with the correct option Mean  A type of average where scores are summed and divided by the number of scores Measurement  The assignment of labels to outcomes Measures of central tendency The mean, median, and mode Median  The point at which 50% of the cases in a distribution fall below and 50% fall above Method errors Errors due to differences in test administration Mode  The most frequently occurring score in a distribution Multiple-choice items  Items where there are several answers from which to choose Multiple intelligences A viewpoint that intelligence consists of independent types of intelligence such as kinesthetic and musical Multitrait–multimethod matrix A method of determining construct-based validity that uses multiple traits and multiple methods Negative correlation  A value that ranges from 0 to –1 and reflects the indirect relationship between two variables Neuropsychological tests  Assessments of the relationship between the brain and behavior, usually focusing on such abilities as intelligence, memory, language, and spatial skills

Norm-referenced tests Tests where an individual’s test performance is compared with the test performance of other individuals Norms  A set of scores that represents a collection of individual performances Objective personality tests  Tests that have very clear and unambiguous questions, stimuli, or techniques for measuring personality traits Observed score The score that is recorded or observed Open-ended essay question  An item where there are no restrictions on the response, including the amount of time allowed to finish Options  The items in a matching item that are matched with a premise Ordinal level of measurement  A measurement system that describes how variables can be ordered along some type of continuum Parallel forms reliability A type of reliability that examines the consistency across different forms of the same test Percentile  The point in a distribution below which a percentage of scores fall; also called a percentile score Performance-based assessment A type of classroom assessment where students are asked to demonstrate ability or skill by performing in some way or creating a product Personality tests  Tests that measure enduring traits and characteristics of an individual

Neuropsychology  The study of the relationship between the brain and behavior

Personality trait  An enduring quality such as being shy or outgoing

No Child Left Behind Act (NCLB)  A federal law that focuses on academic achievement in the elementary grades

Personality type A constellation of traits and characteristics

Nominal level of measurement  A measurement system where there are differences in quality rather than quantity Normal curve  See bell-shaped curve Normalized standard score  A score that belongs to a normal distribution Norm-referenced scores  Scores that have meaning when compared with each other

Portfolio  A collection of work that shows efforts, progress, and accomplishment in one or more areas Positive correlation  A value that ranges from 0 to 1 and reflects the direct relationship between two variables Practice effects  The result of a change in observed score due to the familiarity of items on a previous test on the same or similar content; sometimes called carryover effects

324   Tests & Measurement for People Who (Think They) Hate Tests & Measurement Predictive validity A type of validity that examines how well a test outcome is consistent with a criterion that occurs in the future Premises  The terms in a matching item that are matched with options Primary mental abilities Primary abilities that reflect a general notion of intelligence Projective personality tests  Tests that have ambiguous or unclear stimuli Random responding A technique where respondents are randomly assigned one question from two possibilities, one of which is a threatening question. Because only the respondent knows which question they are answering, it allows for an additional layer of privacy

Standard deviation The average deviation from the mean Standard error of measurement  A simple measure of how much observed scores vary from a true score Standardized test  A test that has undergone extensive test development, including the writing and rewriting of items, hundreds of administrations, the development of reliability and validity data, and the development of norms Standard score  A type of score that uses a common metric Stem  The part of a multiple-choice question that sets the premise for the question Supply items Short-answer items in the form of a question

Ratio level of measurement  A measurement system that includes an absolute zero corresponding to an absence of the trait or characteristic being measured

Surveys  An organized set of questions used in research to gather a lot of information from a sample of people

Raw score  The observed or initial score that results from an assessment

Table of specifications  A grid (with either one or two dimensions) that serves as a guide to the construction of an achievement test

Reliability  The quality of a test such that it produces consistent scores Rubric  A written set of scoring rules, often in the form of a table, that identifies the criteria and required parts and pieces for a quality answer or a quality product Scales  Sets of questions all meant to measure the same construct or concept. By combining responses across all the questions, they allow for a single score to be created that represents a variable

Teacher-made tests  Tests constructed by a teacher (Duh!) Test  A tool that assesses an outcome Test bias When test scores vary across different groups because of factors that are unrelated to the purpose of the test

Scales of measurement  Different ways of categorizing measurement outcomes

Test fairness The degree to which a test fairly assesses an outcome independent of traits and characteristics of the test taker unrelated to the focus of the test

Selection items A question format where respondents select their answer from a set of answer options

Test–retest reliability  Reliability that examines consistency over time

Short-answer and completion items  Items that are short in structure and require a short answer as well

Theta  A measure of the underlying ability related to a specific trait

Social Exchange Theory  A theory that suggests people will agree on a trade if rewards are high, the costs are low, and they trust each other

Trait errors  Errors due to differences in an individual

Spearman–Brown formula A measure of reliability used to correct for the computation of the split-half reliability coefficient Split-half reliability coefficient  A measure of internal consistency

Triarchic Theory of Intelligence A theory of intelligence that consists of componential intelligence, experiential intelligence, and contextual intelligence True–false items  Items with two possible answers, one of which is correct True score A theoretical score that represents an individual’s typical level of performance

Glossary  325

Truth in Testing  A law first passed in New York State that guarantees access to tests and their results

Variable  Anything that can take on more than one value

T score  A standard score that has a mean of 50 and a standard deviation of 10

Variance  The square of the standard deviation, and another measure of a distribution’s spread or dispersion

Validity  The quality of a test such that it measures what it is supposed to measure

Vocational or career tests  Tests that assess interest in particular vocations or careers

Variability  The amount of spread or dispersion in a set of scores

Z score  The number of standard deviations between a raw score and the mean

INDEX Abasement, 186 Ability test, 9, 10 (table). See also Intelligence tests Absolute zero, 28 Achievement tests, 9, 10 (table), 62, 131, 144–145 (table) criterion-referenced, 134–136 measures, 132 norm-referenced, 134 purposes, 132–133 sampling, 142–143 table of specifications, 138–142 teacher-made vs. standardized, 133 validity and reliability, 143 ACT. See American College Test (ACT) Adaptive testing, 271 Advanced Placement Examination in Studio Art, 8 Aggression, 38–39, 77 hostile, 69 instrumental, 69 measures of, 78–80 theory of, 77 American College Test (ACT), 4, 8, 51, 131, 150, 152, 156 (table) American Educational Research Association, 289 American Psychological Association, 68, 184, 289 Analysis-level questions, 140 Anderson, N., 299 Anota, A., 114 Anthropometric measurements, 6 Anti-vaccine movement, 287 Application-level questions, 140 Aptitude tests, 9, 10 (table), 173 achievement tests and, 152 cognitive skills and knowledge, SAT, 150 creation of, 151–153 GLAT, 149, 150 intelligence test and, 173 logical reasoning, 153 potential/future performance, 151 predictive validity, 151 psychomotor performance, DAT, 150 types, 154–155 validity and reliability, 158 Armed Services-Civilian Vocational Interest Survey (ASVIS), 204, 206 (table)

Armed Services Vocational Aptitude Battery (ASVAB), 157 (table) Artistic aptitude tests, 154 Asymptotic, normal curve, 100 Attitude surveys format for, 249–250 guidelines, 249 Likert format, 250–251 statements, 248 Thurstone method, 251–253 Attitude Toward Health Care Test (ATHCT), 52 Averages, 91 mean, 92–93 measures of central tendency, 92 median, 93–95 mode, 95 standard deviation, 96–98 Babbage, C., 284 Bacon, F., 283 Bakke, A., 292 Barbieri, A., 114 Bartholomew, R., 284 Basal age, 169 Bascoul-Mollevi, C., 114 Behavior Change (Burkhardt), 26 The Bell Curve (Herrnstein and Murray), 286 Bell-shaped curve, 99, 99 (figure). See also Normal curve Betrayers of Truth (Broad and Wade), 286 Binet, A., 6, 167 Birnbaum, A., 115 Bloom, B., 138, 139 Bodily-kinesthetic intelligence, 165 Bolles, R. N., 203, 204 Bonnetain, F., 114 Book smart, 164 Boston Naming Test, 191 Breuning, S., 286 Broad, W., 286 Buckley Amendment, 282–283 Bureau of Labor Statistics, 203 Burkhardt, K., 26 Buros Center for Testing, 299–301 Burt, C., 292–293

327

328   Tests & Measurement for People Who (Think They) Hate Tests & Measurement Bush, G. W., 276 Bush, V., 284 California Verbal Learning Test, 191 Campbell Interest and Skill Survey (CISS), 198, 206 (table) Campbell, J., 78 Career development tests, 10, 10 (table), 197–198, 205–206 (table) counseling, 203–204 Holland and SDS, 201–203 SII, 199–201 validity and reliability, 207 Carlson, J. F., 299 Carryover effects, 48 Cattell, J., 6 Cattell, R., 184 Cattell-Horn-Carroll Theory, 164 Ceiling age, 169–170 Charcot, J.-M., 167 Chronological age, 168 CISS. See Campbell Interest and Skill Survey (CISS) Civil Rights Act, 278 Civil service position, 5 Classical Test Theory (CTT), 13, 15, 41, 114–115, 137 Cleary, T. A., 268 Cleary model, 268 Clerical aptitude tests, 155 Clock-Drawing Test, 192 Closed-ended essay question, 232 Coefficient alpha, 56–57 Coefficient of determination, 74 Componential intelligence, 165 Comprehension-level questions, 140 Computer adaptive testing (CAT), 271 Computerized adaptive testing, 122–123 Concurrent validity, 73–74 Congress passes the National Research Act in 1973, 286 Conroy, T., 114 Consequential validity, 264 Construct-based validity, 70 (table), 76–81, 79 (figure) Constructed-response items guidelines, 230–231 products, 228 skills and abilities, 228–229 student performance and test scores, 229, 230 (figure) validity and reliability, 237–238 written assignments, 228 Content-based validity, 70 (table), 71–72 Content validity ratio (CVR), 71–72

Contextual intelligence, 165 Convergent validity, 80 Conway, J., 293 Cooking Skills scale test, 73–75 Correction for attenuation, 82 Correlation coefficients computing steps in, 43–44 formula, 42 screening test, 43–44 types of, 45–46, 45 (table) COVID-19 pandemic, 7 Criterion-based validity, 70 (table), 73–76, 158, 207 Criterion contamination, 76 Criterion groups, 182–183 Criterion-referenced scores, 88 Criterion-referenced tests, 108, 134–136 Criterion-related validity, 73 Cronbach, L., 56, 220 Cronbach’s alpha (α), 56–57, 59 Crystallized intelligence, 164 CTT. See Classical Test Theory (CTT) Cut score, 88 CVR. See Content validity ratio (CVR) Darsee, J., 286 Darwin, C., 5 DAT. See Differential Aptitude Test (DAT) The Dating Game, 186 Dawson, C., 284 Decision-making process, 280 DeNobel, V., 287 Denver II, 156 (table) Descent with modification, 5 DeVellis, R., 244 Diagnostic and Statistical Manual of Mental Disorders (DSM), 182 Dichotomous, 57–59 Dictionary of Occupational Codes, 202–203 DIF. See Differential item functioning (DIF) Difference-difference bias, 265–266 Differential Aptitude Test (DAT), 150, 151, 152 (table), 157 (table) Differential item functioning (DIF), 266–268 Difficulty index, 19 Difficulty level, IRT, 117–118, 118 (figure) Direct correlation, 45 Discriminant validity, 80 Discrimination level, IRT, 118–120, 119 (figure) Dispersion, 96 Distractors, 215 Double-barreled questions, 249 Draw a Person (DAP), 188 (table)

Index  329

Educational Testing Service (ETS), 7, 86, 142, 150 Education for All Handicapped Children Act of 1975, 278 Edwards, A., 186 Edwards Personal Preference Schedule, 186 Edwards scales, 186 eHarmony®, 186 Electric shock experiments, 285 Elementary and Secondary Education Act of 1965, 276 Emotional Intelligence (Goleman), 166 Employment evaluation, 198 Error score, 39–40 ESSA. See Every Student Succeeds Act (ESSA) Essay questions closed-ended, 232 guidelines, 232–233 open-ended, 231 Ethical principles, tests and measurement, 288–289 ETS. See Educational Testing Service (ETS) Evaluation-level questions, 141 Every Student Succeeds Act (ESSA), 277 Executive function, 191–192 Experiential intelligence, 165 Face validity, 72 Factor, 184 Factor analysis intelligence tests, 163–164 personality tests, 184–186 FairTest, 264–265 Family Educational Rights and Privacy Act (FERPA), 282–283 Federal agencies, 286 FIGHT scale, 77–78, 80 Figural Fluency Test, 192 Fill-in-the-blank items, 220–222 Finger tapping, 29 Fisk, D., 78 Five-factor model, 185 Flynn, J., 291 Flynn effect, 291 Ford, G., 278 Forensic assessment, 192–193 Franklin, B., 75 Freud, S., 77, 167, 204 Frey, B. B., 6, 9, 23, 24, 133, 134, 153 Galton, F., 6 Gardner, H., 165–166 Geisinger, K. F., 299 General Education Development (GED), 145 (table)

General-Factor Theory, 162–163 Gesell, A., 154, 155 Gesell Child Development Age Scale, 155 Giles, J., 285 GLAT. See Google Labs Aptitude Test (GLAT) Goleman, D., 166 Google, 149–151 Google Labs Aptitude Test (GLAT), 149, 150 Gould, S., 170 Gourgou-Bourgade, S., 114 Govern, J., 68 Graduate Record Exam (GRE), 4, 73, 144 (table), 155 Handbook of Psychological and Educational Assessment of Children: Intelligence and Achievement (Robertson), 136 Harmon, L., 203 Health Problems Checklist, 8 Helsinki Declaration, 285 Herrnstein, R., 286 High-stakes test, 135, 280–281 Holland code, 201, 202 (figure) Holland, J., 201–203 Hooper Visual Organization Test (VOT), 192 Horn-Cattell Model, 164 Hostile aggression, 69 Howard, M., 293 ICC. See Item characteristic curve (ICC) Indirect correlation, 45 Individualized Education Program (IEP), 279, 280 Individuals With Disabilities Education Act (IDEA), 279–280 Information and Communication Technology Literacy Assessment, 142 Inkblot, 179–180, 180 (figure) Institutional Review Boards (IRBs), 286, 288 Instrumental aggression, 69 Intelligence tests, 6–7, 9, 10 (table), 161–162 book smart and street smart, 164 emotional, 166–167 factor analysis, 163–164 General-Factor Theory, 162–163 multiple-factor approach, 163 multiple intelligences, 165–166 neuropsychological tests, 190–191 Stanford–Binet Intelligence Scale, 167–170 types, 164–165, 171–172 (table) validity and reliability, 170, 173 Interception, 186 Intermediate Measures of Music Audition (Grades 1–6), 154

330   Tests & Measurement for People Who (Think They) Hate Tests & Measurement Internal consistency reliability, 47 (table), 51–55, 57, 60 Interpersonal intelligence, 165 Interrater reliability, 38–39, 47 (table), 49–50, 60 Interval level of measurement, 26–28, 27 (figure), 32, 250, 253 Intrapersonal intelligence, 165 Iowa Test of Basic Skills (ITBS), 133, 145 (table) IRT. See Item Response Theory (IRT) iStartStrong Report, 200 Item-by-item basis Cleary model, 268 DIF, 266, 267 face of things model, 267–268 handy-dandy four-step procedure, 268–269 IRT, 266 Item characteristic curve (ICC), 116–117, 116 (figure), 125 Item Response Theory (IRT), 15, 42, 113–114, 138, 266 ability, 115 computerized adaptive testing, 122–123 CTT, 114–115 curves, 120–122 development of, 115–116 difficulty level, 117–118, 118 (figure) discrimination level, 118–120, 119 (figure) “good” and “poor” items, 125, 126 (figure) ICC, 116–117, 116 (figure) information function, 122 IRTPRO analysis, 123–125, 124 (figure), 125 (figure) latent trait, 115 Unidimensional Analysis dialog box, 123, 124 (figure) Item sampling, 50 Jeopardy strategy, 169 Jonson, J. L., 299 Juzyna, B., 114 Kaplan, S., 7 Kaufman Assessment Battery for Children (K-ABC), 172 (table) Kelvin scale, 29 Keyed answer, 215 Kinsey, A., 285 Knowledge-level questions, 140 Kreishok, T., 199 Krugman, S., 285 Kuder Occupational Interest Survey (KOIS), 205 (table) Kuder-Richardson formula, 57–59

Latent trait, 115 LaValle, K., 280 Lavergne, C., 114 Law School Admission Council, 153 Law School Admission Test (LSAT), 153 Lawshe, C. H., 71 Least restrictive environment (LRE), 278–280 Level of measurement, 21–22, 31–32, 250 characteristics of, 29–30, 30 (table) interval, 26–28, 27 (figure), 32 nominal, 23–25 ordinal, 25–26 ratio, 28–29 variables, 22–23 Levels of Bloom’s taxonomy, 139, 140 (table), 141 (table) analysis-level questions, 140 application-level questions, 140 comprehension-level questions, 140 evaluation-level questions, 141 knowledge-level questions, 140 synthesis-level questions, 141 Likert, R., 250, 251 Likert format, 56, 250–251 Linguistic intelligence, 165 Liquid intelligence, 164 Loaded questions, 247 Logical-mathematical intelligence, 165 Lord, R., 115 Marsh, L., 68 Mastering Vocational Education Test (MVET), 48 Matching items, 217–218 Maturation process, 154 McCarthy Scales of Children’s Abilities (MSCA), 172 (table) MCT. See Minnesota Clerical Test (MCT) Mean, 92–93 Measurement, 22 levels of, 21–32 professional practice, 275–293 randomness, 40 SEM, 108–110 See also Test(s) Measures of central tendency, 92 Mechanical aptitude tests, 154 Median, 93–95 Mele, P., 287 Memory Assessment Scales, 191 Mendel, G., 286 Mental age, 168

Index  331

Mental Measurements Yearbook (MMY), 299, 300 (figure) Mental tests, 6–7 Messick, S., 264, 266 Method errors, 41, 114 Milgram, S., 285 Millikan, R., 286 Mind control research program, 285 Minnesota Clerical Test (MCT), 155, 157 (table) Minnesota Multiphasic Personality Inventory (MMPI), 4, 182–183, 187 (table) Misconduct, 287 The Mismeasure of Man (Gould), 170 MMY. See Mental Measurements Yearbook (MMY) Mode, 95 Modern Test Theory, 113. See also Item Response Theory (IRT) Multilingual Aphasia Test, 191 Multiple-choice items context dependent, 216 critical thinking skills and creativity, 214–215 distractors, 215 keyed answer, 215 rules, 215–216 stem, 215 understanding level, 214 Multiple intelligences, 165–166 Multitrait–multimethod matrix, 78, 79 (figure) Murray, C., 286 Murray, H., 186 Musical intelligence, 165 Myers-Briggs Type Inventory (MBTI), 4, 178 National Adult Reading Test (NART), 190 National Council on Measurement in Education, 289 National Institutes of Health (NIH), 283, 286, 287 Naturalist intelligence, 165 NCLB. See No Child Left Behind Act (NCLB) Negative correlation, 45 Neuropsychological tests, 9, 10 (table), 189 assessment, 189 conditions, 189–190 executive function, 191–192 forensic assessment, 192–193 intelligence, 190–191 language, 191 memory, 191 visuospatial ability, 192 Neuropsychology, 189 Newton, I., 286 New York Department of Health, 285 NIH. See National Institutes of Health (NIH)

No Child Left Behind Act (NCLB), 292 ESSA and, 277 objections, 277–278 primary mission, 276 purpose, 276 Nominal level of measurement, 23–25 Nonthreatening questions, 246 Normal curve, 99, 99 (figure) asymptotic, 100 distribution of cases, 101–103, 101 (figure) mean, median and mode, 99–100 raw score and standard deviation, 100–101, 100 (figure) symmetrical, 100 Norm-referenced scores, 88 Norm-referenced tests, 134 Norms, 88 Novak, M., 115 The Novum Organon (Bacon), 283 Nuremberg Code, 284, 285 Nuremberg Trials, 284 Obama, B., 277 Objective personality tests, 179 Observed score, 15, 39–41, 87 On the Origin of Species (Darwin), 5 Open-ended essay question, 231 Options, 217 Ordinal level of measurement, 25–26 Parallel forms reliability, 47 (table), 50–51, 60 Pearson product-moment correlation coefficient, 42, 51 Percentile/percentile ranks, 88–91 Performance-based assessment essay questions, 231–233 portfolios, 235–237, 236 (table) rubrics, 233–235, 234 (table) skill and ability, 228 validity and reliability, 237–238 Personality tests, 7, 9, 10 (table), 178, 187–188 (table) content and theory, 181 criterion groups, 182–183 factor analysis, 184–186 objective, 179 projective, 179–180 validity and reliability, 193–194 Personality trait, 178 Personality type, 178 Piaget, J., 167 Poehlman, E., 287 Poisson, R., 286

332   Tests & Measurement for People Who (Think They) Hate Tests & Measurement Portfolios achievement tests, 236 advantages and disadvantages, 236, 236 (table) characteristics, 236–237 description, 235–236 Positive correlation, 45 Practice effects, 48 Predictive validity, 73–76, 151, 158, 173 Premises, 217 Primary Measures of Music Audition (Grades K–3), 154 Primary mental abilities, 163 Projective personality tests, 179–180 Psychological tests, 9 Public Law (PL) 94-142, 278, 279 Rafferty, M., 284 Random responding, 248 Ratio level of measurement, 28–29 Raw scores, 86, 86–87 (table), 87 percentiles, 88–91 z scores, 103–106, 104–105 (table) Readiness aptitude tests, 154–155 Reading ability, 24, 26–27, 184 Reflections on the Decline of Science in England, And Some of Its Causes (Babbage), 284 Reliability, 12, 37–39 achievement tests, 143 aptitude tests, 158 career development tests, 207 coefficient, 59–60 conceptual, 40–42 constructed-response and performance-based items, 237–238 correlation coefficient, 42–46 Cronbach’s alpha (α), 56–57 equation, 41 establishing, test, 61–62 intelligence tests, 170, 173 internal consistency, 47 (table), 51–55, 57–59 interrater, 38, 47 (table), 49–50 Kuder-Richardson formula, 57–59 number on, 42–46 objectively scored items, 222–223 parallel forms, 47 (table), 50–51 personality tests, 193–194 scores, 38–41 split-half, 52–55 test–retest, 38, 47–49, 47 (table) types of, 46–59, 47 (table) validity and, 69–70, 81–82 Remember Everything Test (RET), 50–51

Researcher-made tests, 133 Response rate, 253–255 Restricted response essay question, 232 Reverse discrimination, 292 Review sample, MMY, 299, 300 (figure) Revised NEO Personality Inventory, 187 (table) Rey, A., 192 Rey–Osterrieth Complex Figure Test (ROCFT), 192 Robertson, G. J., 136 Rorschach inkblot test, 179–180, 180 (figure), 187 (table) Rubrics, 233–235, 234 (table) Rudner, L., 120 Salkind, N. J., 23, 61, 134 SB-5, 168, 171 (table) Scale development steps, 244–246 Scales, 244 Scales of measurement, 30, 31 (figure) Schlueter, J. E., 299 Scholastic Aptitude Test (SAT) achievement-like items, 152 Cleary model, 268 cognitive skills and knowledge, 150 purpose, 156 (table) test fairness, 263–264 Truth in Testing law, 280–281 School admissions, 292 Schultz, C., 131 Science: The Endless Frontier (Bush), 284 SDS. See Self-Directed Search (SDS) Selection items, 213 matching, 217–218 multiple-choice, 214–217 short-answer and completion, 220–222 true–false, 218–220 validity and reliability, 222–223 Self-Directed Search (SDS), 201–203, 205 (table) SEM. See Standard error of measurement (SEM) Sexual Behavior in the Human Female (Kinsey), 285 Sexual Behavior in the Human Male (Kinsey), 285 Short-answer and completion items, 220–222 Shyness, 184 SII. See Strong Interest Inventory (SII) Simon, T., 6, 167 Situational Self-Awareness Scale (SSAS), 68 16PF Adolescent Personality Questionnaire (APQ), 184–185, 187–188 (table) Slosson Full-Range Intelligence Test (S-FRIT), 172 (table) Social Exchange Theory, 253–255 Spatial intelligence, 165

Index  333

Spearman, C., 162, 163 Spearman–Brown formula, 52–53 Split-half reliability coefficient, 52–55 Spread, 96 Stability, 47, 50 Standard deviation computing, 96–99 normal curve, 100–103 SEM, 109 z score, 103–106, 108 Standard error of measurement (SEM), 61, 109–110 Standardized tests, 13, 14 (table) steps, 136–167 teacher-made vs., 133 Standard score, 103 T scores, 106–108 z scores, 103–106 Standards, professional practice, 289–290 Stanford Achievement Tests (SATs), 4, 7, 51 Stanford–Binet Intelligence Scale, 6, 167–170 Stems, 215, 216 Sternberg, R., 164, 165 Stevens, S. S., 23 Street smart, 164 Stress and Health: Journal of the International Society for the Investigation of Stress (Verhaeghe), 25 Strong, E., 199 Strong Interest Inventory (SII), 199–201, 205 (table) Strong Vocational Interest Blank (SVIB), 4, 199 Stroop, J., 192 Stroop Test, 192 Succorance, 186 Supply items, 220–222 Surveys attitudes, 248–253 nonthreatening questions, 246 questionnaires and, 244 random responding, 248 response rate, 253–255 scale development steps, 244–246 threatening questions, 247 Synthesis-level questions, 141

classification, 11 courses, 15–16 creation, 13–15, 14 (table) CTT. See Classical Test Theory (CTT) Darwin’s reasons, 5–6 diagnosis, 11 forms of, 11 measurement and, 4, 8–12, 15–19 neuropsychological, 9, 10 (table), 189–193 overview, 5–8, 10 (table) personality, 9, 10 (table), 178–188, 193–194 placement, 11 research, 11 selection, 11 vocational or career, 10, 10 (table), 197–207 Test bias, 262 definition, 262 FairTest, 264–265 item-by-item, 266–269 models of, 265–266 test fairness and, 263–264 Test of Memory and Learning, 191 Test–retest reliability, 38, 47–49, 47 (table), 59 Tests in Print IX (Anderson, Schlueter, Carlson & Geisinger), 299 Thematic Apperception Test (TAT), 180 Theory of aggression, 77 Theta, 116–117 Threatening questions, 247 Thurstone, L., 163, 251 Thurstone method, 251–253 Time sampling, 47, 50 Tobacco, 287 Trait errors, 41, 114 Triarchic Theory of Intelligence, 164 True–false items, 218–220 True score, 13, 15, 39–40 Truth in Testing law, 280–281 T scores, 106–108 The twenty-first mental measurements yearbook (Carlson, Geisinger & Jonson), 299 Typically developing/intellectual disability, 168

Table of specifications, 71, 138–142 Teacher competency, 291–292 Teacher-made tests, 133 TerraNova, 144 (table) Test(s), 4 ability/intelligence, 9, 10 (table), 161–173 achievement, 9, 10 (table), 131–145 aptitude, 9, 10 (table), 149–158 characteristics, 12

Universal design, 261, 269 buildings, 270 content and wording of tests, 271 contrast, 271 illustrations, 271 text formatting, 270 typefaces, 270 white space, 270–271 Unrestricted response essay question, 231

334   Tests & Measurement for People Who (Think They) Hate Tests & Measurement Untransformed score, 87 U.S. Department of Energy, 284 U.S. Office of Science and Technology Policy, 287 Vaccination rates, 287 Validity, 12, 67–68 achievement tests, 143 aptitude tests, 158 arguments, 68–69, 81 career development tests, 207 concurrent, 73–76 construct-based, 70 (table), 76–81, 79 (figure) constructed-response and performance-based items, 237–238 content-based, 70 (table), 71–72 convergent, 80 criterion-based, 70 (table), 73–76 criterion-related, 73 definition, 68 discriminant, 80 face, 72 intelligence tests, 170, 173 objectively scored items, 222–223 personality tests, 193–194

predictive, 73–76 reliability and, 69–70, 81–82 strategies, 69 types, 70–81, 70 (table) unitary, 80 Variability, 96 Variable, 22 Variance, 96 Verhaeghe, R., 25 Visuospatial ability tests, 192 Vocational or career tests, 10, 10 (table). See also Career development tests Wade, N., 286 Wakefield, A., 287 Wechsler Adult Intelligence Scale (WAIS), 171 (table), 190 Wechsler Test of Adult Reading (WTAR), 190 Wisconsin Card Sorting Test (WCST), 192 Woods, C., 38–39 Woodworth Personal Data Sheet, 181 Work personalities, 201 Z scores, 103–106, 104–105 (table)