Historical and Conceptual Foundations of Measurement in the Human Sciences: Credos and Controversies [1 ed.] 0367225247, 9780367225247

Historical and Conceptual Foundations of Measurement in the Human Sciences explores the assessment and measurement of no

195 12 10MB

English Pages 358 [381] Year 2021

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Cover
Half Title
Title Page
Copyright Page
Contents
List of Tables
List of Figures
Preface
Acknowledgments
1 What Is Measurement?
1.1 Keating’s War and Thorndike’s Credo
1.2 What Is (and What Is Not) Measurement?
1.2.1 Four Definitions of Measurement
1.2.2 Measurement Terminology
Attributes and Objects
Classical Conception of Quantity, Magnitude, and Units
Extensive and Intensive Attributes
The Metrological Definition of Quantity and the Role of the Reference
Homogeneity
Experimentation and Invariance
1.2.3 So Can We Measure the Greatness of a Poem?
1.3 Educational and Psychological Measurement
1.4 Overview of This Book
Appendix: Some Important Statistical Concepts
The Binomial Probability Distribution
The Law of Errors: The Normal Distribution
The Central Limit Theorem
The Probable Error
2 Psychophysical Measurement: Gustav Fechner and the Just Noticeable Difference
2.1 Overview
2.2 The Origins of Psychophysics
2.2.1 Fechner’s Background
2.2.2 Fechner’s Conceptualization of Measurement
2.2.3 Weber’s Law
2.2.4 A Measurement Formula
2.3 The Method of Right and Wrong Cases (the Constant Method)
2.3.1 An Illustration of the Experiment
2.3.2 Applying the Law of Errors
2.4 Criticisms
2.4.1 The Quantity Objection
2.4.2 Derivation of the Measurement Formula and Estimation of jnds
2.5 Fechner’s Legacy
2.6 Sources and Further Reading
Appendix: Technical Details of Applying the Cumulative Normal Distribution Function as Part of the Constant Method
3 Whenever You Can, Count: Francis Galton and the Measurement of Individual Differences
3.1 Overview
3.2 Galton’s Background
3.2.1 The Polymath
3.2.2 Nature and Nurture
3.2.3 A Brush With Failure
3.2.4 Heredity and Individual Differences
3.3 Three Influences on Galton’s Thinking
3.3.1 Quetelet’s Social Physics
3.3.2 The Quincunx
3.3.3 The Cambridge Mathematics Tripos
3.4 The Concept of Relative Measurement
3.4.1 Use of the Normal Distribution in Hereditary Genius
3.4.2 A Statistical Scale for Intercomparisons
3.5 Galton’s Conceptualization of Measurement
Appendix: An Illustration of Galton’s Method of Intercomparison (Relative Measurement)
4 Anthropometric Laboratories, Regression, and the Cautionary Tale of Eugenics
4.1 Galton’s Instrumental Innovations
4.1.1 Anthropometric Laboratories
Test of Color Sense
Test of Length Judgment
Test of Angle Judgment
4.1.2 Exploratory Instrumental Approaches to the Measurement of Human Intelligence
Mental Imagery and Visualization
Associative Processes
4.2 The Discovery of Regression and Correlation
4.2.1 Galton’s Initial Insights About a Statistical Model for Heredity
4.2.2 Regression to Mediocrity
4.2.3 Correlation
4.3 The Horror of Eugenics
4.4 Galton’s Legacy
4.5 Sources and Further Reading
5 Mental Tests and Measuring Scales: The Innovations of Alfred Binet
5.1 Overview
5.2 Binet’s Background
5.2.1 By the Force of His Fists
5.2.2 From Hypnosis to the Intellectual Development of Children
5.3 The Binet–Simon Measuring Scale
5.3.1 Necessity as the Mother of Invention
5.3.2 The 1905 Scale
5.3.3 The 1908 and 1911 Revisions
5.3.4 The Role of Education
5.4 Binet’s Conceptualization of Measurement
5.5 Criticisms
5.6 Binet’s Legacy
5.7 Sources and Further Reading
6 Measurement Error and the Concept of Reliability
6.1 Overview
6.2 Spearman’s Background
6.3 Disattenuating Correlation Coefficients
6.4 Replications, Occasions, and Measurement Error
6.4.1 Yule’s Proof
6.4.2 Thought Experiments and Shots Fired
6.4.3 The Problem of Defining a Unique Measurement Occasion
6.5 Varying Test Items and the Spearman–Brown Prophecy Formula
6.6 The Development of Classical Test Theory
7 Measurement Through Correlation: Spearman’s Theory of Two Factors
7.1 Formalization of the Theory of Two Factors
7.2 Method of Corroborating the Theory
7.3 Building a Model of Human Cognition
7.4 The Interpretation of g
7.5 The Utility of the Two-Factor Theory
7.6 Spearman’s Conceptualization of Measurement
8 Theory vs. Method in the Measurement of Intelligence
8.1 Challenges to the Theory of Two Factors
8.2 Godfrey Thomson’s Sampling Theory of Ability
8.3 Edwin Wilson and the Indeterminacy of g
8.4 Louis Thurstone’s Multiple-Factor Method
8.5 Spearman on Defense
8.5.1 Responses to Thomson
8.5.2 Responses to Wilson
8.5.3 Responses to Thurstone
8.6 Spearman’s Legacy
8.7 Sources and Further Reading
Appendix: Simulating Thomson’s Sampling Theory Model
9 The Seeds of Psychometrics: Thurstone’s Subjective Units
9.1 Overview
9.2 Thurstone’s Background
9.3 Toward Psychological Measurement
9.3.1 Discriminal Processes
9.3.2 The Law of Comparative Judgment
9.4 Constructing a Psychological Continuum
9.4.1 Applying the Law of Comparative Judgment in the Street of Chance Experiment
9.4.2 Applying the Method of Equal Appearing Intervals in The Hide-Out Experiment
9.5 Thurstone’s Conception of Measurement
9.5.1 Subjective Measurement Units
9.5.2 The Role of Invariance
9.6 Likert Scales
9.7 Thurstone’s Legacy
9.8 Sources and Further Reading
10 Representation, Operations, and the Scale Taxonomy of S. S. Stevens
10.1 Overview
10.2 Stevens’s Background
10.3 Norman Campbell and the Representational Approach to Measurement
10.3.1 Fundamental and Derived Measurement
10.3.2 The Ferguson Committee
10.4 Stevens’s Conceptualization of Measurement
10.4.1 On the Theory of Scales of Measurement
Broadening the Definition of Measurement
The Stevens Scale Taxonomy
10.4.2 Operationalism
10.5 The Process of Operational Measurement
10.5.1 The Method of Magnitude Estimation
10.5.2 The Power Law
10.5.3 Cross-Modality Matching
10.5.4 The Role of Argument and Pragmatism
10.6 Criticisms
10.6.1 A Logical Inconsistency and an Operational Problem
10.6.2 An Axiomatic Critique
10.6.3 Michell’s Realist Critique
10.7 Stevens’s Legacy to Measurement
10.8 Sources and Further Reading
References
Index
Recommend Papers

Historical and Conceptual Foundations of Measurement in the Human Sciences: Credos and Controversies [1 ed.]
 0367225247, 9780367225247

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

HISTORICAL AND CONCEPTUAL FOUNDATIONS OF MEASUREMENT IN THE HUMAN SCIENCES

Historical and Conceptual Foundations of Measurement in the Human Sciences explores the assessment and measurement of nonphysical attributes that defne human beings: abilities, personalities, attitudes, dispositions, and values. The proposition that human attributes are measurable remains controversial, as do the ideas and innovations of the six historical fgures—Gustav Fechner, Francis Galton, Alfred Binet, Charles Spearman, Louis Thurstone, and S. S. Stevens—at the heart of this book. Across 10 rich, elaborative chapters, readers are introduced to the origins of educational and psychological scaling, mental testing, classical test theory, factor analysis, and diagnostic classifcation and to controversies spanning the quantity objection, the role of measurement in promoting eugenics, theories of intelligence, the measurement of attitudes, and beyond. Graduate students, researchers, and professionals in educational measurement and psychometrics will emerge with a deeper appreciation for both the challenges and the afordances of measurement in quantitative research. Derek C. Briggs is Professor in the Research and Evaluation Methodology Program in the School of Education and Director of the Center for Assessment Design Research and Evaluation at the University of Colorado Boulder, USA. A former editor of the journal Educational Measurement: Issues & Practice, he is the 2021–2022 President of the National Council on Measurement in Education.

HISTORICAL AND CONCEPTUAL FOUNDATIONS OF MEASUREMENT IN THE HUMAN SCIENCES Credos and Controversies

Derek C. Briggs

First published 2022 by Routledge 605 Third Avenue, New York, NY 10158 and by Routledge 2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN Routledge is an imprint of the Taylor & Francis Group, an informa business © 2022 Taylor & Francis The right of Derek C. Briggs to be identifed as author of this work has been asserted by him in accordance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988. All rights reserved. No part of this book may be reprinted or reproduced or utilised in any form or by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying and recording, or in any information storage or retrieval system, without permission in writing from the publishers. Trademark notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identifcation and explanation without intent to infringe. Library of Congress Cataloging-in-Publication Data Names: Briggs, Derek C., author. Title: Historical and conceptual foundations of measurement in the human sciences : credos and controversies / Derek C. Briggs. Description: New York, NY : Routledge, 2022. | Includes bibliographical references and index. Identifers: LCCN 2021020032 | ISBN 9780367225247 (hardback) | ISBN 9780367225230 (paperback) | ISBN 9780429275326 (ebook) Subjects: LCSH: Psychometrics—History. | Psychological tests—History. | Educational tests and measurements—History. | Scaling (Social sciences)—History. Classifcation: LCC BF39 .B758 2022 | DDC 150.1/5195—dc23 LC record available at https://lccn.loc.gov/2021020032 ISBN: 978-0-367-22524-7 (hbk) ISBN: 978-0-367-22523-0 (pbk) ISBN: 978-0-429-27532-6 (ebk) DOI: 10.1201/9780429275326 Typeset in Bembo by Apex CoVantage, LLC

CONTENTS

List of Tables List of Figures Preface Acknowledgments 1

What Is Measurement? 1.1 Keating’s War and Thorndike’s Credo 1.2 What Is (and What Is Not) Measurement? 1.2.1 Four Defnitions of Measurement 1.2.2 Measurement Terminology Attributes and Objects Classical Conception of Quantity, Magnitude, and Units Extensive and Intensive Attributes The Metrological Defnition of Quantity and the Role of the Reference Homogeneity Experimentation and Invariance 1.2.3 So Can We Measure the Greatness of a Poem? 1.3 Educational and Psychological Measurement 1.4 Overview of This Book Appendix: Some Important Statistical Concepts The Binomial Probability Distribution The Law of Errors: The Normal Distribution The Central Limit Theorem The Probable Error

x xii xv xx 1 1 3 3 5 5 5 6 8 9 10 11 12 15 21 21 23 23 24

vi

2

3

4

Contents

Psychophysical Measurement: Gustav Fechner and the Just Noticeable Diference 2.1 Overview 2.2 The Origins of Psychophysics 2.2.1 Fechner’s Background 2.2.2 Fechner’s Conceptualization of Measurement 2.2.3 Weber’s Law 2.2.4 A Measurement Formula 2.3 The Method of Right and Wrong Cases (the Constant Method) 2.3.1 An Illustration of the Experiment 2.3.2 Applying the Law of Errors 2.4 Criticisms 2.4.1 The Quantity Objection 2.4.2 Derivation of the Measurement Formula and Estimation of jnds 2.5 Fechner’s Legacy 2.6 Sources and Further Reading Appendix: Technical Details of Applying the Cumulative Normal Distribution Function as Part of the Constant Method Whenever You Can, Count: Francis Galton and the Measurement of Individual Diferences 3.1 Overview 3.2 Galton’s Background 3.2.1 The Polymath 3.2.2 Nature and Nurture 3.2.3 A Brush With Failure 3.2.4 Heredity and Individual Diferences 3.3 Three Infuences on Galton’s Thinking 3.3.1 Quetelet’s Social Physics 3.3.2 The Quincunx 3.3.3 The Cambridge Mathematics Tripos 3.4 The Concept of Relative Measurement 3.4.1 Use of the Normal Distribution in Hereditary Genius 3.4.2 A Statistical Scale for Intercomparisons 3.5 Galton’s Conceptualization of Measurement Appendix: An Illustration of Galton’s Method of Intercomparison (Relative Measurement) Anthropometric Laboratories, Regression, and the Cautionary Tale of Eugenics 4.1 Galton’s Instrumental Innovations 4.1.1 Anthropometric Laboratories

27 27 32 32 35 37 39 41 41 43 50 51 55 56 59 60

63 63 65 65 66 68 71 73 73 74 75 78 78 85 89 95

100 100 100

Contents

4.2

4.3 4.4 4.5 5

6

Test of Color Sense Test of Length Judgment Test of Angle Judgment 4.1.2 Exploratory Instrumental Approaches to the Measurement of Human Intelligence Mental Imagery and Visualization Associative Processes The Discovery of Regression and Correlation 4.2.1 Galton’s Initial Insights About a Statistical Model for Heredity 4.2.2 Regression to Mediocrity 4.2.3 Correlation The Horror of Eugenics Galton’s Legacy Sources and Further Reading

Mental Tests and Measuring Scales: The Innovations of Alfred Binet 5.1 Overview 5.2 Binet’s Background 5.2.1 By the Force of His Fists 5.2.2 From Hypnosis to the Intellectual Development of Children 5.3 The Binet–Simon Measuring Scale 5.3.1 Necessity as the Mother of Invention 5.3.2 The 1905 Scale 5.3.3 The 1908 and 1911 Revisions 5.3.4 The Role of Education 5.4 Binet’s Conceptualization of Measurement 5.5 Criticisms 5.6 Binet’s Legacy 5.7 Sources and Further Reading Measurement Error and the Concept of Reliability 6.1 Overview 6.2 Spearman’s Background 6.3 Disattenuating Correlation Coefcients 6.4 Replications, Occasions, and Measurement Error 6.4.1 Yule’s Proof 6.4.2 Thought Experiments and Shots Fired 6.4.3 The Problem of Defning a Unique Measurement Occasion 6.5 Varying Test Items and the Spearman–Brown Prophecy Formula 6.6 The Development of Classical Test Theory

vii

105 105 105 107 108 110 111 113 117 121 124 130 131

135 135 139 139 142 144 144 148 155 162 164 168 170 173 177 177 179 182 187 187 189 193 196 199

viii Contents

7

Measurement Through Correlation: Spearman’s Theory of Two Factors 7.1 Formalization of the Theory of Two Factors 7.2 Method of Corroborating the Theory 7.3 Building a Model of Human Cognition 7.4 The Interpretation of g 7.5 The Utility of the Two-Factor Theory 7.6 Spearman’s Conceptualization of Measurement

205 205 208 211 216 218 221

8

Theory vs. Method in the Measurement of Intelligence 8.1 Challenges to the Theory of Two Factors 8.2 Godfrey Thomson’s Sampling Theory of Ability 8.3 Edwin Wilson and the Indeterminacy of g 8.4 Louis Thurstone’s Multiple-Factor Method 8.5 Spearman on Defense 8.5.1 Responses to Thomson 8.5.2 Responses to Wilson 8.5.3 Responses to Thurstone 8.6 Spearman’s Legacy 8.7 Sources and Further Reading Appendix: Simulating Thomson’s Sampling Theory Model

227 227 229 235 239 243 244 245 247 248 252 254

9

The 9.1 9.2 9.3

259 259 260 265 265 270 273

9.4

9.5 9.6 9.7 9.8

Seeds of Psychometrics: Thurstone’s Subjective Units Overview Thurstone’s Background Toward Psychological Measurement 9.3.1 Discriminal Processes 9.3.2 The Law of Comparative Judgment Constructing a Psychological Continuum 9.4.1 Applying the Law of Comparative Judgment in the Street of Chance Experiment 9.4.2 Applying the Method of Equal Appearing Intervals in The Hide-Out Experiment Thurstone’s Conception of Measurement 9.5.1 Subjective Measurement Units 9.5.2 The Role of Invariance Likert Scales Thurstone’s Legacy Sources and Further Reading

10 Representation, Operations, and the Scale Taxonomy of S. S. Stevens 10.1 Overview 10.2 Stevens’s Background

274 279 283 283 285 286 288 289

292 292 295

Contents

10.3 Norman Campbell and the Representational Approach to Measurement 10.3.1 Fundamental and Derived Measurement 10.3.2 The Ferguson Committee 10.4 Stevens’s Conceptualization of Measurement 10.4.1 On the Theory of Scales of Measurement Broadening the Defnition of Measurement The Stevens Scale Taxonomy 10.4.2 Operationalism 10.5 The Process of Operational Measurement 10.5.1 The Method of Magnitude Estimation 10.5.2 The Power Law 10.5.3 Cross-Modality Matching 10.5.4 The Role of Argument and Pragmatism 10.6 Criticisms 10.6.1 A Logical Inconsistency and an Operational Problem 10.6.2 An Axiomatic Critique 10.6.3 Michell’s Realist Critique 10.7 Stevens’s Legacy to Measurement 10.8 Sources and Further Reading

References Index

ix

298 298 303 304 304 304 306 310 313 313 316 317 319 321 321 322 326 329 332

334 353

TABLES

2.1 The SI Base Units and the Defning Constants Used to Realize Them 2.2 Hypothetical Results From an Experiment Judging Weight Diferences Using the Method of Constant Stimulus 4.1 Statistical Scale With Descriptive Reference Points for Illumination of Visualized Mental Image 4.2 Cross-Tabulation of Height (stature) by Forearm Length (length of left cubit) as Reproduced by Bulmer (2003) 5.1 The 1905 Binet–Simon Measuring Scale of Intelligence 5.2 Excerpts From Binet’s Diagnosis of Three Children Using the 1905 Binet–Simon Scale 5.3 The 1911 Binet–Simon Measuring Scale of Intelligence 6.1 The Primary Analytic Samples Used in Spearman’s 1904 Publications 6.2 Observed Correlations From the Village Sample (N = 24) 7.1 An Example of a Hierarchical Correlation Matrix 8.1 Random Assignment of Number of Group and Specifc Factors to Tests 8.2 Group Factors (columns) Assigned at Random to Each Hypothetical Test (rows) 8.3 Patterns of Overlap in Group Factors by Test Pairing 8.4 Thomson’s Simulated Correlation Hierarchy 8.5 Wilson’s Hypothetical Example 8.6 Correlation Matrix With Tests Before (lower triangle) and After Transformation (upper triangle)

30 43 109 122 150 154 158 183 184 210 231 232 232 233 236 239

Tables

9.1 Proportion of the Schoolchildren in Mendota, Illinois, Who Said That the Ofense at the Top of the Table Is More Serious Than the Ofense at the Side of the Table 9.2 Unit Normal Deviates Associated With the Proportions of the Schoolchildren in Mendota, Illinois, Who Said That the Ofense at the Top of the Table Is More Serious Than the Ofense at the Side of the Table 10.1 The Stevens Taxonomy for Measurement Scales 10.2 Examples Provided by Stevens of Measures With Diferent Scale Properties 10.3 Direct Response Methods for Operational Measurement

xi

275

276 306 308 314

FIGURES

1.1 1.2. 2.1 2.2 2.3 2.4 2.5 2.6 3.1 3.2 3.3 3.4

3.5 3.6 3.7

Timeline of Historical Contributions to Measurement Theory and Practice Covered in This Book Example of a Binomial Probability Density Histogram With Normal Curve Overlay Gustav Theodor Fechner (1801–1887) Observed Results From Nine Comparisons of Weights Repeated 20 Times Each Theoretical Results From a Comparison of Weight With Magnitudes xc and xt Over Infnite Replications Theoretical Distribution of Diferences in Sensation Intensity Over Replications Results From Regressing Column 5 on Column 3 of Hypothetical Data in Table 2.1 The Anticipated Result If Fechner’s Law Holds Francis Galton (1822–1911) The First of Three Quincunx Illustrations Galton Included in Natural Inheritance The Distribution of the Students Earning Honors Over 2 Years of the Cambridge Tripos Observed and Expected Distribution of Examination Marks for Candidates Applying to the Royal Military College at Sandhurst, December 1868 The Normal Distribution Generalization in Hereditary Genius Simulated Height Data Comparison of Estimated and Actual z-Scores, r = .964

16 22 33 44 45 46 49 50 67 74 80

82 83 96 97

Figures

3.8 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 5.1 5.2 5.3 6.1 6.2 6.3 6.4 7.1 7.2 7.3 8.1 8.2 8.3 9.1 9.2 9.3 9.4

Comparison of Theoretical and Empirical Ogives Based on Simulated Data The Sheet Used to Record Measurements at Galton’s Anthropometric Laboratory in 1884 The Illustrations of Galton’s Tests of Visual Acuity (“1” and “1a”) and Color Sense (“2”) The Illustrations of Galton’s Tests of Judgment of Length (“3”), Judgment of Angle (“4”), and Speed of a Punch (“5”) Galton’s Two-Stage Quincunx Galton’s Use of the Two-Stage Quincunx to Illustrate the Role of Reversion in Maintaining Population Equilibrium An Annotated Version of Galton’s Table Comparing Stature of Parents and Children That Highlights the Quincunx Parallel Galton’s 1886 Depiction of Regression to the Mean Galton’s Geometric Solution for the Slope of a Regression Line Text of the Virginia Sterilization Act of 1924 Alfred Binet (1857–1911) Changes to the Composition of the Binet–Simon Measuring Scales from 1905 to 1911 Test on the 1908 Binet–Simon Scale Used to Distinguish Children at Ages 3, 7, and 12 Charles Spearman (1863–1945) Illustration of the Concept of an Attenuated Correlation Spearman’s Band-Shooting Example Using the Spearman-Brown Formula to Show the Impact of Doubling Test Length on Reliability A Visual Representation of Spearman’s Two-Factor Theory Spearman’s 1923 Model of “Noegenetic” Human Cognition Spearman’s Conception of Mental Energy Spearman’s Program of Research Related to the Two-Factor Theory Competing Theories of Mental Ability as Represented by Guilford (1936) Parallel Analysis of Thomson’s Simulated Data Louis Leon Thurstone (1887–1955) and Thelma Gwinn Thurstone (1897–1993) Thurstone’s First and Second Graphics Depicting Discriminal Processes Thurstone’s Third and Fourth Graphics Depicting Discriminal Processes Seriousness of Crimes as Judged by 240 High School Students in Mendota, Illinois, Before and After Seeing the Film Street of Chance

xiii

97 102 104 104 113 116 118 119 120 125 139 157 160 179 185 191 197 206 213 217 228 240 255 261 267 267

278

xiv

Figures

9.5

Thurstone’s Item Map Showing the Changes in Scale Locations Before and After Students Viewed the Film Street of Chance 9.6 Thurstone’s Comparison of the Distributions of Students’ Attitudes Toward Prohibition Before and After Seeing the Movie Hide-Out 10.1 Stanley Smith Stevens (1906–1973) 10.2 Results From Direct Magnitude Estimation of Loudness Experiment (Stevens, 1956) 10.3 Jnd Scale, Category Scale and Magnitude Estimation Scale for Apparent Duration

279

282 296 315 319

PREFACE

The inspiration for this book can be traced back to my experiences at two conferences I attended in 2008. As an assistant professor at the University of Colorado Boulder, I had been conducting research on the use of regressionbased approaches to estimate the “efects” of teachers and schools on the academic achievement of students (known as “value-added models”). A requirement for using these models was the availability of longitudinal data on student test performance in the subject domains of mathematics and reading. I wondered whether the estimates from value-added models could be sensitive to choices in the psychometric methods being used to place student test scores onto a common scale, and I set out to explore this question in the context of eforts to develop scales that spanned multiple grades of schooling, a method known as vertical scaling. The approach I took seemed sensible enough. Through a great deal of efort, I was able to secure longitudinal test data that included item-level response information for cohorts of students in the state of Colorado, and with these data in hand, I could create diferent vertical scales according to the diferent methods commonly used in psychometric practice. Next, I would apply the same value-added model to data where the only diference was the vertical scaling approach, and then examine whether this led to signifcant diferences in the estimates of school efects. In all, it was a fne study, and the fndings were eventually published in two diferent peer-reviewed journal articles (Briggs & Weeks, 2009a; 2009b). What we found, in a nutshell, was that the method used to create the vertical scale could matter, but it mattered a lot less than you might expect. One of the things that bothered me about this study was that I was not able to make any kind of principled recommendation as to the preferred psychometric method for creating a vertical scale. All I could say—and this seemed to be the

xvi

Preface

conventional wisdom of the time—was that there is no one right way to create a scale for the purpose of making inferences about student growth (e.g., Yen, 1986). There are approaches A, B, C, and D, and each will lead to a diferent conclusion about the amount of growth observed from grade to grade and the variability of that growth. No one approach is better or worse than the other. But surely this is problematic, I thought, because it suggests that there are no falsifable criteria that are being used to evaluate the success of any efort to construct a vertical scale, and for other uses that hinge upon judgments of magnitude, the choice of approach could be consequential. I was invited to give a presentation on my work, prior to its publication, at a National Conference on Value-Added Modeling that was convened in Madison, Wisconsin, in the spring of 2008. It was there that I met Dale Ballou, an economist, who was also asking some questions about the intersection of psychometric and value-added modeling methodology. Ballou pointed out that the regression models being used to estimate value-added all make the assumption that student test performance can be expressed on an interval score scale. His question seemed pretty straightforward: How do we know if test scores are on an interval scale as opposed to an ordinal scale? He expected that this was a question that would have a clear answer in the literature on psychometrics and educational measurement. To a great extent, he found that the conventional answer was somewhat similar to what I had found with regard to the choice of method for creating a vertical scale: We don’t really know. The paper he put together based on this inquiry (Ballou, 2009) was frankly a lot better than mine and suggested some gaps in my understanding of educational measurement that I would need to fll. An even more fortuitous event came later than fall when I attended a special conference on the concept of validity in the context of educational and psychological testing. The conference, organized by Bob Lissitz and held at the University of Maryland, included an invited presentation from a name I had just encountered in the reference section when reading an early draft of Ballou’s paper: Joel Michell. Michell gave a talk that he had written in advance titled “Invalidity in Validity” (Michell, 2009), and the thesis he advanced was that psychometricians take for granted what should be regarded as a scientifc hypothesis: that psychological attributes are measurable quantities. From Michell’s perspective, this was a question that was logically prior to the question of whether test scores could be placed on an interval scale. I was defensive as hell about Michell’s thesis. But I have always appreciated scholars that are willing to fy in the face of conventional wisdom, and it was notable to me that no one attending or participating in the conference had what I considered to be an adequate rebuttal. So I set out to learn more. I read Michell’s 1999 book Measurement in Psychology, Denny Borsboom’s 2005 book Measuring the Mind, and many articles and book chapters that had never made their way into my graduate training. I began to grapple with the fundamental question: What does it really mean to measure in the human sciences?

Preface

xvii

Now, one thing that is appealing about Michell’s book and related articles is that he has a clear and well-argued answer to this question. You can get a sense for what I took away from this and applied to the context of measuring growth with vertical scales in Briggs (2013), Briggs and Domingue (2013), and Briggs and Peck (2015). At this point, now a full professor, I thought that with a better grasp on diferences in theories of measurement, that I was ready to write a book that brought together issues of measurement and modeling in the context of growth in student learning. I planned to have an opening chapter that would lay out the foundational issues related to theories of measurement and stake out my own position. What began as an opening section of the chapter kept getting longer and longer and longer. And I realized that the book you have in front of you now was the book that I needed to write. What Michell does quite masterfully in his book is to juxtapose a classical understanding of measurement with the broadened conceptualization of measurement now predominant in the human sciences. But there is something of a gap in Michell’s narrative from the time when Gustav Fechner frst proposed an approach for the measurement of psychological attributes in 1860, to the publication of Stevens’s On the Theory of Scales of Measurement in 1946. Michell characterizes this period as a time when psychologists embraced Fechner’s “modus operandi” and when the measurement of human attributes began to diverge from the classical roots of measurement in the physical sciences. But it only constitutes 18 pages of his book and left me with many questions. One of the things I do in this book is to fll in this gap by exploring in more detail the contributions of four men that receive relatively little attention in Michell’s book: Francis Galton, Alfred Binet, Charles Spearman, and Louis Thurstone. I also focus on the two “bookends” to the broadening of the conceptualization of measurement in the human sciences: Gustav Fechner and S. S. Stevens. In doing research on these historical fgures, I came to appreciate how much more there is to understand about them. With the passage of time, it is easy to lose track of just what it was that these men actually did and the context in which they did it. In this book I try to take you inside the problems they were trying to solve and the way they were trying to leverage or alter the concept of measurement in the process. I also show that many of the measurement concepts and assumptions we take for granted in contemporary practice have their roots in the period spanning from 1860 to 1950. There is an important criticism that can be levied against this book that I want to acknowledge up front. Most of the research and writing for this book has taken place during a very tumultuous and fraught time in American history. During 2020, as I was completing the last chapters of this book, the entire world was facing with a common crisis in the form of the COVID-19 pandemic. But in the United States, we have been dealing with another much longer lasting pandemic in the form of systemic racism that has contributed to social and economic inequality and injustice. In late May 2020, this came to a full

xviii Preface

boil with protests that followed the murders of George Floyd and Breonna Taylor at the hands of police ofcers and then again on January 6, 2021, when insurrectionists, carrying Trump paraphernalia and other symbols of racism, stormed the U.S. Capitol building to impede a lawfully elected president from assuming ofce. With this as a backdrop, I have asked myself whether this is the right time to bring a book into this world that provides a history of the conceptualizations of measurement by white men as conveyed by another white man. The answer I have to this criticism comes from one of the messages that was to be found in the poem that was delivered by Amanda Gorman during the inauguration of President Joseph Biden just 15 days after the failed insurrection at the U.S. Capitol. As Gorman reminded us, to move forward as a society, we need to actively revisit and engage with the lessons of our past: “[I]t’s the past we step into and how we repair it.” It is an unfortunate truth about the history of how measurement has been conceptualized and practiced in the human sciences, that the people that have had the greatest infuence have been almost uniformly white men of a particular elite cultural background. However, while we might lament this history and take steps to ensure that measurement as a feld of study and practice becomes more diverse in the future, this does not mean that the ideas of the men who are the focus of this book, and the methods of measurement that they championed, should be rejected out of hand. Rather, we need to try to understand these ideas and methods and to do our best to discern how they were infuenced by the personal and social contexts of the time. Most importantly, we need to refect on the extent to which we have built on lessons from the past or forgotten them altogether. I have struggled the most with how to write about Francis Galton. Galton invented and promoted eugenics as a science, and his innovations in measurement were intended to support this as a program of research. Although he did not write extensively on topics of race and racial diferences, when he did, (e.g., in his book Hereditary Genius), what he had to say was fairly stunning in both its callousness and its lack of evidentiary basis. By contemporary standards, Galton’s perspectives on what he took to be inherited group diferences in ability were deeply racist. And in the way such perspectives were taken up to justify eugenic policies and practices, they contributed to some horrifc consequences. At the same time, Galton’s perspectives as an elite member of the 19th-century British colonial empire were not unusual. This does not excuse them, but it does help to explain them. The approach I take in writing about Galton is to neither fully demonize him (as Martin Brookes [2004] does in his semifctional biography Extreme Measures: The Dark Visions and Bright Ideas of Francis Galton) nor put him on a pedestal (as Karl Pearson does in his three-volume biography of Galton). Galton, like all the historical fgures I have researched for this book, and like most

Preface

xix

people in general, was a complicated human being. He made some incredible discoveries and inventions related to measurement that have made the world a better place to live. He also held some convictions and promoted generalizations that put him on the wrong side of history. I do my best to present and come to terms with both sides of Galton. I can’t promise that after fnishing this book you will emerge with clear answers to the kinds of questions about measurement that started me down this path. Indeed, I am still trying to fgure those out for myself. But what I can promise is that this book will contribute to the development of your own ideas about the nature of measurement in the human sciences and will help you to participate in a scholarly conversation that we really need to be having in the 21st century. As more and more procedures are invented to attach numbers to objects and events, and as increasing number of statisticians, data scientists, and computer programmers are enlisted to develop these procedures, the easier it will become to lose sight of what measurement entails, and of the credos and controversies that have long been a part of the dialogue about measurement in the human sciences. My hope is that this book will provide a common foundation for understanding these credos and controversies and that taking this trip through the past will help us shine a clearer light on the practices of the future.

ACKNOWLEDGMENTS

Although I frst started work on a book that I was planning to call Measuring Growth in Educational Contexts in the fall of 2016, it was not until my academic sabbatical during the calendar year of 2018 that the structure of a rather diferent book began to take shape. The writing of this book has been an intense experience, one that fully occupied my 2018 sabbatical, the summer of 2019, and every spare moment I could fnd in 2020. For each of the six main historical fgures who are the focus of my 10 chapters, I committed myself to a period of research that involved reading their original work in their own words (or, in the case of Fechner and Binet, in their translated words). I have also read any biographical materials or related scholarship in journal and book chapter publications that I was able to fnd through the University of Colorado’s network of digital resources and interlibrary loan program. Without these resources, my research would have been almost impossible to carry out. I need to acknowledge several people for their help in providing comments on chapter drafts. These include Brian Clauser, Ben Domingue, Cristian Larroulet, Andy Maul, Josh McGrane, and Michael Russell, who each reviewed multiple chapters. Even beyond specifc comments Andy and Josh provided, I have benefted so much from my interactions with them. I hope some of their wisdom has found its way into this book, and to the extent that it has not the fault is all mine. I want to give special thanks to Ben and Cristian for giving me timely feedback when I really needed it near the end of this project. I also need to thank David Torres Irribarra for comments on early versions of my chapters on Fechner and Thurstone; Ben Shear for comments on my introduction; and Dan Mangan for his detailed comments on my Binet chapter. I want to express my gratitude to a handful of people whose advice and scholarship have infuenced my career trajectory. Mark Wilson, who frst

Acknowledgments

xxi

introduced me to the idea of constructing measures in education. Paul Holland and David Freedman, who mentored and challenged me in equal measure as a graduate student. Lorrie Shepard, a role model and sparring partner with a smile throughout my time at the University of Colorado. Completing this book required a great deal of sacrifce during a very difcult time between March 2020 and March 2021 when, like everyone else, my family was “socially distancing” primarily within the confnes of our house and neighborhood walking paths. I am grateful for the support of my wife, Whitney Pinion. Without her, this book would not have been possible.

1 WHAT IS MEASUREMENT?

1.1

Keating’s War and Thorndike’s Credo

I have always loved the movie Dead Poets Society, but upon watching it again for the frst time in many years, a particular scene stood out to me as providing a proper motivation for this book. The movie, which is set in 1959, tells the story of Mr. Keating (played by Robin Williams), a newly hired English teacher at a private preparatory school for boys and the events that are set in motion as he inspires his students to challenge conformity. An early and memorable scene comes when Mr. Keating asks one of his students to read an introduction to an anthology of 19th-century poetry. The passage is titled “Understanding Poetry” and has been written by (a fctitious) Dr. J. Edwards Prichard, PhD. The following passage is read aloud in its entirety: To fully understand poetry, we must frst be fuent with its meter, rhyme and fgures of speech, then ask two questions: (1) How artfully has the objective of the poem been rendered? and (2) How important is that objective? Question 1 rates the poem’s perfection; question 2 rates its importance. And once these questions have been answered, determining the poem’s greatness becomes a relatively simple matter. If the poem’s score for perfection is plotted on the horizontal of a graph and its importance is plotted on the vertical, then calculating the total area of the poem yields the measure of its greatness. A sonnet by Byron might score high on the vertical but only average on the horizontal. A Shakespearean sonnet, on the other hand, would score high both horizontally and vertically, yielding a massive total area, thereby revealing the poem to be truly great. As you proceed through the poetry in this book, practice this rating DOI: 10.1201/9780429275326-1

2

What Is Measurement?

method. As your ability to evaluate poems in this matter grows, so will your enjoyment and understanding of poetry. While this is being read, Mr. Keating has plotted out two “greatness” areas for the Byron and Shakespeare sonnets on an illustrative graph, along with the equation G = A * I, where A represents a rating of the “artfulness” with which a poem’s objective has been rendered, I represents a rating of the “importance” of the poem’s objective, and the product of the two, G, represents the measure of a poem’s greatness. He turns back to the class with the equation and graph in place and, the passage reading complete, ofers a concise evaluation: Excrement! He demands that the students rip out the entire Prichard introduction so that only the poetry itself is left, and as this is happening, he explains: We’re not laying pipe, we’re talking about poetry. How can you describe poetry like [rating a song on the show] American Bandstand? .  .  . This is a battle, a war. And the casualties can be your heart and soul. Armies of academics going forward measuring poetry. No, we will not have that here. In my class you will learn to think for yourselves again, you will learn to savor words and language. No matter what anybody tells you, words and ideas can change the world. It’s an inspiring scene. It concludes with Mr. Keating reading the Walt Whitman poem “Oh Me! Oh Life!” to the students, the fnal lines of which propose an answer to the meaning of life “that the powerful play goes on and you may contribute a verse.” With the students gathered around him and listening with rapt attention, he asks, “What will your verse be?” While we can hardly take issue with Mr. Keating’s entreaty to seize the day, let’s turn our attention back to the implied villain of this scene, J. Edward Prichard, and the “army of academics” measuring poetry that he is intended to personify. Although the movie is a work of fction, for the time period in which it was set, there was, in fact, an emerging army of academics, mostly from the emerging feld of psychology, who were leading the charge to measure all sorts of educational products and psychological attributes that seem as equally resistant to quantifcation as poetry. One famous justifcation for this charge can be traced to the “Credo” articulated by the American educational psychologist Edward Lee Thorndike (1918): Whatever exists at all exists in some amount. To know it thoroughly involves knowing its quantity as well as its quality. Education is concerned with changes in human beings; a change is a diference between two conditions; each of these conditions is known to us only by the products produced by it—things made, words spoken, acts performed, and the like. To measure any of these products means to defne its amount in some

What Is Measurement?

3

way so that competent persons will know how large it is, better than they would without measurement. To measure a product well means so to defne its amount that competent persons will know how large it is, with some precision, and that this knowledge may be conveniently recorded and used. This is the general Credo of those who, in the last decade, have been busy trying to extend and improve measurements of educational products. (16) Thorndike’s Credo does contain some pearls of wisdom. One purpose of education is, in fact, to bring about some change in what a person knows or can do. If Mr. Keating is teaching his students the craft of writing poetry over some period, there must be some collection of knowledge and skills related to the reading of poetry he is hoping that they will develop. Might there not be some value in identifying these as learning objectives so that he can track the extent to which his students are mastering them? And might not measurement play a role in all this? Indeed, perhaps it is not something about the poems we should be trying to measure but something about the students themselves! But maybe we need to be posing a diferent question altogether. That is, in addition to arguing over whether we should measure the “greatness” of a poem (or the “ability” of a student to analyze poetry), we should also be asking why we are willing to believe that we can measure such things. Thorndike’s Credo emerged from a surrounding context of early 20th-century positivism that privileged quantity over quality, with the implication that if something exists, it can be measured, but if it cannot be measured, then it does not really exist. This belief, which Joel Michell has called the “quantitative imperative,” would indeed seem to place Dr. Prichard in direct opposition to Mr. Keating, who would surely point out that things like beauty, romance, and love all exist but that they are multifaceted. When we experience them, we experience them qualitatively, as matters of degree. These are things that might not be measurable. Even if we did decide to follow the procedure described in Prichard’s passage—to give a poem’s artfulness and importance numeric ratings and then multiply the two together—why and in what sense does it follow that this act of numeric assignment constitutes a legitimate instance of measurement? What is measurement, after all?

1.2

What Is (and What Is Not) Measurement?

1.2.1 Four Defnitions of Measurement If you are reading this book, you are probably interested in the answer to this question. You might also have some preexisting ideas about the topic. As it turns out, a consensus defnition that goes beyond the circular understanding

4

What Is Measurement?

of measurement tantamount to “the act or process of measuring” can be elusive. Consider the following four single-sentence defnitions, each of which difers from the other in subtle (and sometimes not-so-subtle) ways: 1.

2. 3. 4.

Measurement is the numerical quantifcation of the attributes of an object or event, which can be used to compare with other objects or events. [Conventional] Measurement is the discovery or estimation of the ratio of a magnitude of a quantity to a unit of the same quantity. [Classical] Measurement is the assignment of numerals to objects or events according to rules. [Psychological] Measurement is the process of experimentally obtaining one or more quantity values that can reasonably be attributed to a quantity. [Metrological]

The frst of these defnitions, the “conventional” defnition, is what can be found by conducting the laziest of internet searches (I typed the word measurement into my internet web browser). The top website that this returned was the internet encyclopedia Wikipedia, and this was the opening sentence of the defnition to be found there in December 2020 (“Measurement,” n.d.). The citation associated with this sentence comes from a 1991 textbook, Measurement, Design, and Analysis: An Integrated Approach by Pedhazur, Schmelkin, and Pedhazur. Interestingly, the second sentence for the entry qualifes the opening one by noting that “[t]he scope and application of measurement are dependent on the context and discipline.” If the frst defnition captures some sense of the conventional wisdom of the crowds as jointly curated by Google and Wikipedia, then the next two represent defnitions from specifc and infuential scholarship. The second “classical” defnition is taken from the book Measurement in Psychology: A Critical History of a Methodological Concept by Joel Michell. The third “psychological” defnition comes from two very infuential publications by the psychologist Stanley Smith Stevens in the mid-20th century (Stevens, 1946, 1951). The fourth “metrological” defnition has been taken from the third edition of the International Vocabulary of Measurement (VIM) and represents a consensus defnition from the feld of metrology1 (JCGM, 2012). A commonality across these defnitions is that measurement seems to have something to do with quantity (conventional, classical, and metrological defnitions) and numbers or numerals (conventional and psychological defnitions). But beyond this, if we spend some time scrutinizing each defnition, they tend to raise as many questions about measurement as they answer. All but the psychological defnition seems to place a focus on quantity or quantifcation. Magnitudes and units fgure prominently in the classical defnition; experimentation and attribution, in the metrological defnition. The use of measurement for the purpose of comparison is a distinguishing feature of the conventional defnition. The

What Is Measurement?

5

psychological defnition stands out as the broadest conceptualization of measurement as an activity that need only involve the assignment of numerals according to rules (although perhaps it strikes some readers as unusual to refer to numerals instead of numbers). It also difers from the conventional defnition in that measurement is specifed as something that applies to objects and events, as opposed to an attribute of an object or event. The most important thing to appreciate at this point is the difculty in fnding any adequate single-sentence defnition of measurement. In particular, although each of these defnitions provides some hints about the sorts of activities that are ruled in as measurement, they do little to settle what gets ruled out.

1.2.2

Measurement Terminology

Attributes and Objects At least some of this confusion can be addressed through clarity about terminology. In this book, I generally distinguish between objects2—things that exist and can be directly observed at some specifc location and moment in space and time—and the attributes of objects. An attribute is a characteristic of an object. The distinction between object and attribute is important because the same object can have many diferent attributes. Minerals can be characterized by their size, color, hardness, and aesthetic appeal; poems by their use of metaphor, sentence complexity, and emotions elicited from readers; people by their intelligence, kindness, and curiosity. There are two other terms I could just as well have used in place of attribute that could be given an equivalent interpretation: property or quality. That is, we could just as easily refer to size, color, hardness, and aesthetic appeal as four distinct “properties” or “qualities” of minerals. I refer to an “attribute” instead of a “property” for the relatively trivial reason that property has a connotation that is often specifc to inanimate objects in the physical sciences. I use attribute instead of quality because, as we will see, some attributes of objects either are, or can at least be conceptualized as if they were quantitative, and it seems more sensible to refer to a “quantitative attribute” than it does to refer to a “quantitative quality.”3

Classical Conception of Quantity, Magnitude, and Units What defnes a quantity? Aristotle (384–382 bce) distinguished between two kinds of quantities, those that were discrete and those that were continuous. An example of a discrete quantity, which Aristotle referred to as a multitude, comes from the act of counting or aggregating individual objects that are of the same kind. Because numbers to the ancient Greeks were always whole numbers, a distinguishing feature of a multitude was that it was numerable and composed of discrete units. In contrast, a continuous quantity was known by its magnitude. Euclid (323–283 bce) defned a magnitude as “a part of a magnitude, the lesser

6

What Is Measurement?

of the greater, when it measures the greater” and “the greater is a multiple of the less when it is measured by the less” (Michell, 1999, 26–27). In other words, any given magnitude is just a specifc level of a continuous quantity. If a single magnitude is selected as a unit, any other magnitude of a quantity can be measured as a ratio of that unit. Hölder (1901) would be the frst to demonstrate that when certain axioms are met, a ratio of two magnitudes of the same quantity—measurement—provides an empirical basis for the concept of a real number in mathematics. Hence, under the classical defnition of measurement, quantity and quantitative have a very restrictive meaning consistent with the Greek conceptions of the magnitude of a continuous quantity.

Extensive and Intensive Attributes Fundamental to the classical understanding of measurement is the idea that once a unit (i.e., a standard) has been selected, a continuous quantity can be decomposed and/or recomposed into some amount of these units, a magnitude that can be expressed as a real number multiplied by the unit. This requirement, known as additivity, is readily demonstrable for three of the most common measurement activities, the measurement of length, time, and weight. Here it helps that the attributes of length and weight are directly observable to the human senses in a manner that makes it easy to not only intuit but also demonstrate that the attributes are extensive—that is, they difer in magnitude as spatial events that can be perceived by sight or touch. Perhaps instead of saying that weight and length are “directly” observable, it is better to say that when we measure the attributes, it is possible to do so by comparing an unknown amount of the attribute for one object (e.g., the length of my desk) with some reference standard of that same attribute for another object (e.g., the length of my forearm). Time is also considered an extensive attribute, but it is more abstract than length and weight. St. Augustine’s sentiment, expressed at the turn of the 5th century ad, probably holds even more strongly in the modern era: For what is time? Who can readily and briefy explain this? Who can even in thought comprehend it so as to utter a word about it? But what in discourse do we mention more familiarly and knowingly, than time? And, we understand, when we speak of it; we understand also, when we hear of it spoken by another. What then is time? If no one asks me, I know: if I wish to explain it to one that asketh, I know not . . . (From The Confessions of St. Augustine as cited by Barnett, J.E., in Time’s Pendulum, p. 5) In a physical sense, time is an attribute of the earth (or perhaps of our solar system). The earth rotates on its axis and orbits the sun, and the earliest evidence of this came from cyclical changes in the light from the sun over the course of

What Is Measurement?

7

a day and the orientation of objects in the sky over the course of a year. Over the course of our history, we have become increasingly more sophisticated in our ability to harness this natural law of astronomy so that time can be measured in terms of the duration of its passage to the point that the meaning of a second can be completely divorced from astronomical observation. That additivity applies to time only becomes evident after it can be defned as, for example, the fow of sand in an hourglass, the cycles of a pendulum or, since 1967, the duration of 9,192,631,770 periods of the radiation corresponding to the transition between the two hyperfne levels of the ground state of the caesium-133 atom. Our understanding that time is additive—that 5 minutes is the sum of 300 seconds and always means the same thing whether it passes in the morning, afternoon, or evening—is inextricable from our ability to measure it. The point here is that even for the most canonical examples of extensive attributes, measurement cannot be separated from human culture and conventions. Nonetheless, when measurement involves extensive attributes, it can be recognized as a matter of direct comparison between two instances of the same quantity (e.g., through the use of a ruler, a stopwatch, or a balance beam). In contrast, other well-known physical attributes—and here temperature is the canonical example—are sometimes also referred to as intensive attributes. An intensive attribute is one that, in the absence of instrumentation, is something we observe indirectly. That is, we notice qualitative diferences in some object or event of interest, and we infer that this has been caused by a change in the underlying attribute. As a case in point, the cofee I drink every morning has just three approximate temperature states with the following order: too hot (I will burn my tongue if I try to drink it), just right (I can drink it and enjoy its favor), and not hot (it is about the same temperature as my surroundings). Of course, I could measure the temperature of my cofee with a thermometer, but in the absence of this instrument and without an obvious way to demonstrate that it meets the requirement of additivity, why am I so sure that temperature is measurable as a continuous quantity? In fact, if I take my cup of cofee and add it to my wife’s cup of cofee that has been sitting on the counter for the same amount of time, I will not get a cup that is twice as hot. For an intensive attribute, it is not just that we are unable to demonstrate additivity but that the way the efects of an attribute are observed may not be the same as it increases. The underlying attribute may well be a continuous quantity, but the way we experience it is not. A slight increase in heat when my cofee is “just right” could cause me to burn my tongue, and this would be received quite diferently from the efect I notice when I turn up the thermostat in my house from 65 degrees F to 68 degrees F. To reiterate, an extensive attribute is one that can be conceptualized in terms of spatial extent and directly observed as a quantity; an intensive attribute is one conceptualized as an intensity and can only be indirectly observed as ordered states.

8

What Is Measurement?

With these defnitions related to classical measurement in place, what can we say about our prospects for measuring the “greatness” of a poem? Recall that our Dead Poets Society villain Dr. Prichard provided us with a procedural approach. Let us include as an additional specifc detail that we would frst rate the “artfulness” and “importance” of a poem on an integer scale going from 1 to 5. The greatness of a poem is then ostensibly measured by the product of the two ratings, and this product is expressed on a scale with possible whole-number values ranging from 1 to 25. From a classical perspective, it seems impossible to regard this as a legitimate instance of measurement. What we have done here is to identify two attributes of poems that are thought to vary in a manner qualitatively discernible to a reader. The attributes for a given poem can be rated and then multiplied, but the resulting unit clearly has an ambiguous interpretation. In particular, although it might superfcially appear that a poem with a greatness measure of 16 is twice as great as a poem with a measure of 8, it seems a tall order to delineate the words or sentences from the poem with a 16 that would need to be removed or added to change its measure to that of an 8. Similarly, on what basis can we conclude that the diference in greatness of 8 between these two poems is the same magnitude as the diference between two poems with measures of 20 and 12? And yet, according to the earlier psychological defnition, since numerals are being assigned to poems according to rule, Dr. Prichard’s procedure does indeed appear to qualify as measurement!

The Metrological Defnition of Quantity and the Role of the Reference Does measurement pertain only to attributes that are (or are at least hypothesized to be) continuous quantities? Or should it also apply to attributes that are only ordered, or even to attributes that are nominal categories? According to the third edition of the VIM, the source for the metrological defnition of measurement, an orderable attribute is also a quantity, so it is sensible to speak of measuring it. The same cannot be said for categorical (i.e., nominal) attributes. The term quantity is defned in the VIM as “a property of a phenomenon, body or substance, where the property has a magnitude that can be expressed as a number and a reference.” Importantly, the “reference” in this defnition is not restricted to a measurement unit (as in the classical conception). A reference could include documentation that describes how each of the levels of a quantity are to be interpreted. A consequence is that under the metrological defnition, a magnitude can be given a broader interpretation than the ratio of an unknown magnitude to standard magnitude. A magnitude of a quantity is also interpretable relative to ordered levels defned by a measurement procedure or a reference manual. The most commonly used example of this in the physical sciences is the Mohs Scale of Mineral Hardness, which produces a measure of hardness on a scale with integer levels ranging from 1 to 10. Each ordered level of the scale is

What Is Measurement?

9

represented by a reference mineral. For example, levels 1 through 3 are talc, gypsum, and calcite; levels 8 through 10 are topaz, corundum, and diamond. The reference minerals are ordered according to their ability to visibly scratch or be scratched by another mineral. So gypsum is above talc because it can be used to scratch talc, but talc cannot be used to scratch gypsum. A mineral unique to these reference minerals can be located between any two of these numbers (so far, another mineral that can be scratched by talc or can scratch a diamond has not, to my knowledge at least, been identifed). When the hardness of the “Gorilla Glass” manufactured to cover the surface of a smartphone is thus measured using the Mohs Scale, we fnd that it falls between a 5 and 6, and beyond this, the magnitude of the measure is given a qualitative interpretation relative to the two reference minerals at these levels, apatite and orthoclase feldspar. It would seem, then, that Dr. Prichard could at least argue that a poem’s artfulness and importance can be measured in the same sense as hardness is measured using the Mohs Scale. We need only identify reference poems that can be ordered into levels of 1 through 5 in terms of both artfulness and importance. With these in mind, any new poem might be located relative to these levels. But this seems easier said than done. How do we choose our reference poems? Is there an analog to the scratching of minerals in this context? Furthermore, the ratings are only intended to be the means to the end of measuring greatness as G = A * I. What numeric measure should we assign for poems that fall between levels? Returning to the Mohs Scale, it is well known that the diferences in hardness across levels are not commensurate. Over time, more precise procedures for the measurement of hardness as a continuous quantity have been devised using an instrument known as a sclerometer. A sclerometer can be used to locate a mineral on an absolute scale as a function of the width of a scratch made by a diamond under a fxed load and drawn across the face of the specimen under fxed conditions. When the Mohs levels from 8 to 10 are compared against this more precise scale we fnd that a diamond (level 10) is 3.75 times harder than corundum (level 9), while corundum is only twice as hard as topaz (level 8). The same problem is likely to appear when rating poems—the diference between a rating of a 5 and a 4 may well indicate more (or less) of the targeted attribute than a diference between a 4 and a 3.

Homogeneity So far, I have surfaced at least two related sources of debate over what does and does not constitute measurement. The frst source of debate has to do with whether measurement only pertains to attributes that are continuous quantities. Even if we broaden measurement to include attributes that are only ordered (calling these discrete quantities in contrast to continuous quantities), how can we discern an intensive attribute that is continuous from one that is ordered? Or one that is ordered from one that is not? The second source of debate comes

10 What Is Measurement?

in the choices that are made in how we interpret and compare diferences in the amount of an attribute with respect to a reference standard through the use of a measuring instrument. For a continuous quantity, we need to defne a scale with a unit of measurement that is homogeneous. What this means is that the magnitude conveyed by a change of a unit on the scale always conveys the same kind of information about the attribute. In contrast, the unit of a measuring scale for an ordered attribute may well convey heterogeneous information as we move from one level of the scale to the next. In such cases, we would need additional information beyond the unit to interpret diferences in an attribute for any two objects (e.g., qualitative descriptions of levels of the scale).

Experimentation and Invariance If you will forgive me the irresistible Mohs-related pun, we have only scratched the surface. After all, another way to defne and understand measurement is not by its central activity but by its fundamental purposes, which, in my view, are (1) to reduce our uncertainty about the quantity value of a targeted attribute and (2) to report a quantity value that can be generalized beyond the specifc and local implementation of the measurement procedure. The role of uncertainty and its reduction is alluded to in the metrological defnition with the qualifcation that measurement involves “experimentally obtaining one or more quantity values that can reasonably be attributed to a quantity.” Experimentation involves the control of irrelevant sources of variability so that an intended source can be isolated. This is what a good measuring instrument and a standardized procedure for its use are designed to accomplish. Yet what we measure is still only an estimate (a “reasonable attribution”), because uncertainty can never be fully eliminated from a measurement process, and uncertainty itself comes from more than one source: the uncertainty caused by “errors” in the measurement procedure itself and uncertainty caused by our often limited theoretical understanding of the attribute being measured. To be clear, by “theoretical” understanding I mean having a plausible hypothesis that explains when and under what circumstances values of the attribute are—and are not—expected to vary. The idea that the results of a measurement procedure should be generalizable means that the measure we produce should not depend on the characteristics of the person doing the measuring, the object being measured, or the instrument that intervenes between the measurer and the measured. A term that captures this requirement is that the measure we produce should be invariant to superfcial features of the measurement context. For example, imagine that I walk into a classroom and hand one of my students a ruler and two books, and ask the student to compare the two books by measuring the thickness of each one from front cover to back cover. The conclusion we reach from this comparison should not depend on the student’s gender identity or on the kind of ruler I provided or the content of the books being compared. The conclusion we reach must only be sensitive to the diferences in thickness of the two books.

What Is Measurement?

1.2.3

11

So Can We Measure the Greatness of a Poem?

In summary then, our attempt to compare the conventional, classical, psychological, and metrological defnitions of measurement has presented us with a mixed bag when it comes to an appraisal of Dr. Prichard’s approach. The conventional defnition seems quite consistent with Prichard’s intent to express the greatness of poems on a quantitative scale so that they can be readily compared. The procedure is an instance of measurement according to the psychological defnition because it involves the assignment of numerals according to rules. However, a consideration of the classical and metrological defnitions suggests many reasons for skepticism. We could begin by questioning the theoretical basis for defning a poem’s greatness as a continuous quantity, G = A * I, in a manner that seems to parallel the measurement of force under Newton’s second law of motion, F = M * A. In Newton’s second law, force, an intensive physical attribute, is measurable as a derived continuous quantity because mass and acceleration are instances of the extensive attributes mass, length, and time. But all that we seem to know about artfulness and importance is possible evidence that these attributes can be ordered. Can such evidence, presented in combination, culminate in the value of a third continuous quantity?4 Next, we might wonder about the basis for establishing reference standards that give meaning to the rating scale levels for artfulness and importance. What reason do we have to believe that this procedure would generalize irrespective of the student doing the rating or if other poems were chosen as the references for the rating scales or if the poem in question happened to be a sonnet, a limerick, or a haiku? Finally, we should question the extent to which his procedure will reduce our uncertainty about a poem’s greatness relative to the more qualitative impression we would have had in its absence. Beyond defning some terms that will appear throughout this book (and giving you a whole new appreciation for a short scene in Dead Poets Society) I hope that I have been successful in convincing you (or perhaps reafrming for you) that measurement is a much more complicated concept than it might frst appear. There is more to measurement than attaching numbers to objects or events according to rule. Even when restricted to the domain of the physical sciences, measurement only appears straightforward because it has been practiced and improved on for hundreds of years. Behind the scenes, an incredible amount of work has been invested to develop a theoretical understanding of the attributes being measured, to design instrumentation that is sensitive to it, and to ensure that units for measurement do not depend on the objects being measured or the instrument being used to measure. For example, both the meter and the second are two of the seven base units in the International System of Units (SI). The ability to realize and reproduce all the base units of the SI in controlled experimental settings is the reason that all the diferent instruments used to measure length and duration can be calibrated to a common standard. Similarly well-understood units of measurement are nowhere to be found when it comes

12 What Is Measurement?

to the quantifcation of nonphysical human attributes. Is the pursuit of such units a worthwhile aspiration, a fool’s errand, or something in between?

1.3 Educational and Psychological Measurement As I write this in 2021, it has become commonplace for people in the human sciences (e.g., those who study education, economics, medicine, political science, psychology, and sociology) to invoke the terminology of measurement when seeking to quantify educational or psychological attributes of a person. In educational contexts, large-scale standardized tests are written under the pretext that they provide the instrumentation necessary to measure what test designers often refer to as a construct. By the same token, in social psychology, it is often claimed that surveys with rating scale items provide the means by which it is possible to measure a person’s attitudes, personality, and so-called noncognitive constructs. When applying statistical models to test or survey item responses, psychometricians and statisticians may opt to drop the construct designation and instead simply designate a latent trait or ability, and attach to it the symbol θ. Our friend theta plays a very specifc technical role in the statistical model—it is there under the assumption that its presence makes item responses conditionally independent. But it is rather easy to lose track of what theta is supposed to be. Is it an attribute, a construct, or a latent trait? Are all these things synonymous?5 We may not think twice when we hear claims to the efect that a test measures the construct of reading or that a survey measures the latent trait of grit. But if you will allow me one last quote from another fctional character, Inigo Montoya from the movie The Princess Bride, “You keep using that word. I do not think it means what you think it means.” My own feld of specialization is typically described as educational measurement, yet I have found from experience that most of my colleagues who regard themselves as either practitioners or purveyors of educational measurement fnd themselves a bit tongue-tied when asked to explain what educational measurement is supposed to mean with precise language. The closest thing that the feld has to the VIM is the Standards for Educational and Psychological Testing. Nowhere in this document can one fnd a consensus defnition provided for educational measurement. Even more surprisingly, one would also emerge empty-handed after consulting the third and fourth editions of the edited volume Educational Measurement published in 1989 and 2006; one would have to turn all the way back to the second edition to fnd a chapter that discusses the nature of measurement (Jones, 1971). Perhaps the most authoritative defnition can be traced back to Lord and Novick’s (1968) seminal book Statistical Theories of Mental Test Scores: We shall defne measurement to mean a procedure for the assignment of numbers (scores, measurements) to the specifed properties of experimental

What Is Measurement?

13

units in such a way as to characterize and preserve specifed relationships in the behavioral domain. (7) This defnition is somewhat an amalgamation of the conventional, psychological, and metrological defnitions presented earlier. It omits the aspects that were most salient in the classical defnition (quantity, magnitude, and units) but inserts a new idea: that the rule for numeric assignment should allow for the preservation of “specifed relationships in the behavioral domain.” In my view, the fact that few would know enough about what Lord and Novick seem to have had in mind here to say whether they agree or disagree with this defnition weakens the foundation on which future progress in educational measurement can be built. When we invoke the terminology of measurement—whether in the context of a physical attribute or a psychological one—we are making a commitment to a scientifc enterprise that involves the interaction between four distinct bodies of knowledge and practice: 1. 2. 3.

4.

substantive theory (or theories) about the attribute of interest; instrumentation designed to elicit variability in the attribute of interest; standardization to ensure that the results of measurement, the numeric values, will have an interpretation that is trustworthy and invariant (within specifed limits) to the objects being measured, the instruments used to produce the numeric values, and the person who is interpreting the values; and mathematical analysis, the application of formal operations and models to these numeric values to ensure there is a correspondence between these values and the attribute of measurement, and for the purpose of quantifying the uncertainty of the values.

A full discussion of these diferent aspects of measurement and how they interact, developed in collaboration with my colleagues Andy Maul and Josh McGrane, is outside the scope of this book, but see Briggs, Maul, and McGrane (forthcoming) for details. Also see Maul, Torres Irribarra, Mari, and Wilson (2018) and Mari, Wilson, and Maul (2021). One takeaway of these sorts of general frameworks is to emphasize that just as there is more to measurement in general than numeric assignment by rule; there is more to educational measurement in particular than requiring a student to spend an hour completing a test composed of multiple-choice items. Testing and measurement are two distinct activities. This assertion is so important that it bears repeating: Testing and measurement are two distinct activities. When certain assumptions are made and conditions are met, they overlap, and in such instances, it may well be a reasonable shorthand to refer to testing as “educational measurement.” But it is important to appreciate the way that this move implies an elevation of testing onto the same level of implied authority

14 What Is Measurement?

that would be found in the feld of metrology for the measurement of temperature and time. The desire to equate the objectivity and trustworthiness of educational measurement to the objectivity and trustworthiness of physical measurement, or to at least formalize this as a desirable analogy, was very much at the heart of Thorndike’s Credo. There is nothing inherently objectionable about this aspiration. Some psychological attributes that we frst appreciate as qualities may well be amenable to measurement as quantities. What must be appreciated is that the aspiration underlying Thorndike’s Credo represents a measurability hypothesis (Michell, 1999). To support this hypothesis, we must overcome the objection that diferences in the attribute in question can only be appropriately understood qualitatively. This is the quantity objection. To ignore the quantity objection altogether—to jump from test to measurement in one fell swoop—is to court disaster. When tests are automatically granted the status of measurement, they are that much more easily appropriated as vehicles for social injustice, even when this may well have been the opposite of the intent of the test designer. I illustrate this with the work of Alfred Binet in Chapter 5 (for another book-length treatment of this theme, see Stein, 2016). One way to sidestep the need to corroborate the measurability hypothesis in a testing context—the need to explain how and in what sense a test serves as an instrument of measurement—is by shifting the focus from the test itself to the way that the scores from a test are being interpreted and used. The latter approach is at the heart of the consensus view of validity theory that has evolved in the literature on educational and psychological testing (Messick, 1989; Kane, 2006; AERA/APA/NCME, 2014). The Standards for Educational and Psychological Testing ofers a fairly concise defnition of (test) validity (AERA/APA/NCME, 2014, 11): “validity refers to the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests.” In this sense, it would seem we can avoid complicated and controversial debates over the boundaries of what does and does not constitute an instance of measurement. We can simply adopt the broadest possible understanding of measurement vis-à-vis the psychological defnition popularized by Stevens: grant all tests the status of measures and make qualitative judgments about the degree to which designated interpretations and uses of these measures are valid. But as Denny Borsboom and colleagues have argued, this apparent escape from diferent conceptions about the meaning of measurement is only an illusion (Borsboom, Mellenbergh, & Van Heerden, 2004; Borsboom, 2005; Markus & Borsboom, 2013). As just one example, while the Standards provides no defnition of measurement, one of its three “foundation” chapters is devoted to the topic of “errors in measurement.” But how can the concept of measurement error be understood before we have settled on whether (and in what sense) the attribute to be measured exists independently of the test used to measure it?6 Questions about the meaning and boundaries of measurement are just as important (if not more so) to discussions about the concept of validity in education and psychology as they are in metrology.

What Is Measurement?

1.4

15

Overview of This Book

Although the concept of measurement and its relationship and applicability to latent human attributes has likely been a matter of debate and philosophical discourse that precedes written records, the development and justifcation for formal procedures to quantify these attributes are relatively new in a historical sense. But not that new! It is surely not much of an exaggeration to say that most graduate students, after they have completed coursework related to psychometrics and educational measurement, would date the birth of the feld to the second half of the 20th century. And this would be understandable. Much of what students are taught about the workings of classical test theory and item response theory can be dated to 1968 with the work of Lord, Novick, and Birnbaum in Statistical Theories of Mental Test Scores. This was one of fve books that I would identify as landmark events in educational and psychological measurement, and each appeared within a handful of years of one another. Rasch’s book Probabilistic Models for Some Intelligence and Attainment Tests came out in 1960; The Measurement and Prediction of Judgment and Choice by Bock and Jones was published in 1968; the frst of the three-volume series Foundations of Measurement by Krantz, Luce, Suppes, and Tversky was published in 1971; and The Dependability of Behavioral Measurements by Cronbach and colleagues appeared in 1972. These seminal contributions to the theory and practice of measurement in the human sciences came at a dizzying pace, and the insights to be found in them are no less relevant now than they were then. But the student approaching each of these books with the question, “What does it mean to measure a latent human attribute?” would almost certainly emerge confused, because these books take rather distinct theoretical perspectives on this topic. If we want to understand these diferent perspectives, we need to go further back in time, to locate the giants on whose shoulders these new giants were standing. To do this, I have chosen to focus attention on the historical period between roughly 1860 and 1960 and to the innovations and contributions spearheaded by six men from four diferent countries: Gustav Fechner (Germany), Francis Galton (England), Alfred Binet (France), Charles Spearman (England), Louis Thurstone (United States) and Stanley Smith Stevens (United States). In this book, I use the careers of each of these men as a vehicle to introduce concepts that are at the heart of the diferent rationales provided for the claim that latent human attributes are measurable. Each of these men faced some version of the quantity objection I have described in this introduction, and each proposed methods and rationales that could be used to overcome it. None of these methods and rationales was beyond reproach; indeed, many of them were as questionable then as they are now, and it was seldom the case that the quantity objection was explicitly acknowledged. But all of them were unquestionably infuential in shaping subsequent discourse and practice of measurement in the human sciences. There is already great preexisting scholarship on these men

16 What Is Measurement?

and their contributions, and I conclude my chapters on them with a short section in which I provide recommendations for further reading. The added value in this book comes in my efort to contextualize their contributions with respect to the concept of measurement in the human sciences. We will get inside the details of not just what they did but also how they did it and what motivated them. We will see how their methods of measurement were received by contemporaries and how this led to controversies that remain with us to the present day. There are rabbit holes aplenty, and we will have some fun exploring them. Figure 1.1 provides a timeline of some of the landmark events that will play a major part in the story I tell in this book. This story begins in the mid-19th Year 1835 1846 1860 1869 1875 1883 1886 1887 1888 1889 1890 1896 1901 1901 1904 1904 1904 1904 1905 1908 1909 1910 1911 1911 1912 1916 1916 1919 1920 1921 1925 1927 1927 1927 1927 1927 1928 1928 1934 1936 1939 1946 1947 1950 1951 1958 1959 1959

Chapter 3 2 2 3 3 4 4 2 4 3 5 5 5, 6 2 6, 7, 8 6, 7, 8 3 6 5 5 5 6 5 6 7 5 8 5 10 2, 6, 7, 8 5, 9 9 9 9 7, 8 5, 8 9 10 8 8 8 10 8 6 10 10 10 10

HistoricaI Figure Quetelet Weber Fechner Galton Galton Galton Galton Fechner Galton Galton Cattell Binet & Henri Wissler Hôlder Spearman Spearman Thorndike Pearson Binet & Simon Binet & Simon Binet Spearman Binet Brown Hart & Spearman Terman Thomson Yerkes Campbell Thomson & Brown Thurstone Thurstone Thurstone Thurstone Spearman Thorndike et al Thurstone Campbell Thurstone Guilford Thomson Stevens Thurstone Gulliksen Stevens Torgerson Stevens Luce

Landmark Developments In Measurement Theory and Practice A Treatise on Man and the Development of his Faculties The Sense of Touch and the Common Sense The Elements of Psychophysics Hereditary Genius Statistics by intercomparison with remarks on the law of frequency of error Inquiries into Human Faculties Regression toward medicocrity in hereditary stature On the principles of mental measurement and Weber’s law Co-relations and their measurement Natural Inheritance Mental tests and measurement Individual psychology The correlation of mental and physical tests The axioms of quantity and the theory of measurement Proof and measurement of association between two things “General intelligence” objectively determined and measured An Introduction to the Theory of Mental and Social Measurements On the laws of inheritance in man New methods for the diagnosis of the intellectual level of subnormals The development of intelligence in the child Modern Ideas about Children Correlation calculated from faulty data New investigations upon the measure of the intellectual level among schoolchildren The Essentials of Mental Measurement (1st edition) General ability, its existence, existence and nature The Measurement of Intelligence A hierarchy without a general factor Report of the psychology committee of the national reasearch council Physics, the Elements The Essentials of Mental Measurement (2nd edition) A method of scaling psychological and educational tests Psychophysical analysis A law of comparative judgment A mental unit of measurement The abilities of man: their nature and their measurement The Measurement of Intelligence Attitudes can be measured An account of the principles of measurement and calculation The Vectors of Mind Psychometric Methods (1st edition) The Factorial Analysis of Human Ability (1st edition) On the theory of scales of measurement Multiple Factor Analysis Theory of Mental Tests Mathematics, measurement and psychophysics Theory and Methods of Scaling On the validity of the loudness scale Individual Choice Behavior

FIGURE 1.1 Timeline of Historical Contributions to Measurement Theory and Practice Covered in This Book.

What Is Measurement?

17

century with Gustav Fechner (1801–1887), an eccentric German academic with a joint passion for physics and philosophy. Fechner, building on earlier investigations from his colleague Ernst Weber, was the frst to introduce both a theory that psychological attributes were measurable and a method for how such measures could be realized in an experimental setting. Fechner coined the term psychophysics to make explicit his conviction that physical stimulus and psychological sensation were intertwined. If changes in physical stimuli have a mathematical relationship to psychological sensation, then it follows that if the measurement units of the stimuli are known in advance, analogous psychological measurement units can be derived and established. I emphasize a theme from this introduction that will also reveal itself in the work of Galton, Binet, Thurstone, and Stevens: the recognition that measurement, whether in physics or psychology, requires a well-defned unit. In Chapters 3 and 4, we turn our attention to Francis Galton (1822–1911), surely the most infuential of all the major historical fgures considered in this book as well as the most controversial. Galton developed a love afair with the normal distribution as a tool for both measurement and the study of individual diferences. Galton’s theory was that all human attributes, whether physical or psychological, were predominantly inherited and therefore could be modeled as the results of a largely random process. When paired with the assumption that psychological attributes are continuous quantities, this opened the door to what Galton would describe as a relative approach to measurement, in which individual diferences observed in the form of rankings could be transformed into a distribution characterized by standard deviation units. Galton understood that relative measurement fell short of absolute measurement, but he would justify it for practical reasons, not the least of which was that it provided the inputs needed for the bivariate correlational analyses that he would also invent. The harbinger to Thorndike’s Credo was Galton’s favorite motto, “Whenever you can, count.” Galton believed that societal progress would only be possible if human attributes could be studied and monitored quantitatively, and he embraced a fairly encompassing view of measurement to that end. Galton established some of the frst public demonstrations for how this could be accomplished at scale. The ominous backdrop to Galton’s quantitative imperative was his advocacy of eugenics, and the unqualifed use of measurement as justifcation for horrifc eugenic policies and practices in the early 20th century gave mental testing a black eye from which it has never entirely recovered. It remains a critically important cautionary tale. In Chapter 5, we meet Alfred Binet (1857–1911), commonly remembered as the inventor of the intelligence test. It was Binet who accomplished what had eluded Galton and those Galton had infuenced—he developed a practically efcient normative method for classifying children according to their intelligence that did not rely on sensory discrimination and reaction speed. It is also true that Binet, like Galton, adopted a fairly liberal perspective on measurement.

18 What Is Measurement?

Although Binet knew his method only produced a ranking of children by their levels of intelligence—a heterogeneous order—he justifed it as a measurement procedure on the practical grounds that it was more objective and informative than prevailing methods. But Binet departed from Galton and the other fgures in this book in his desire to use the results of his measurement procedure for fundamentally diagnostic purposes. Diferences in degree were of much greater interest to Binet than diferences in amount, and Binet had an educational use in mind for his measuring instrument. The idea was to use measurement to intervene in order to improve the livelihoods of children with learning disabilities who were struggling academically, a far more progressive perspective than that which motivated Galton and his followers. Yet by introducing a scale for reporting the results of his intelligence test expressed in the temporal units of age, Binet, through his instrument, nonetheless contributed to the perspective that psychological attributes were just as quantifable as physical ones. In Chapters 6 through 8, we shift focus to Charles Spearman (1863–1945), who developed a quantitative theory of intelligence during the frst three decades of the 20th century that, in his view, explained why Binet’s measuring scale was embraced with such enthusiasm. Spearman proposed that the intelligence measured on any test was a function of two factors: one that was general to any test and one that was specifc. Just as important, Spearman introduced a confrmatory mathematical approach that could be used to evaluate a battery of tests for the presence of a general factor, an approach premised on the successive computation of partial correlation coefcients that evolved into the method of factor analysis. Spearman referred to his approach as measurement, but it was measurement of a very unusual kind, with a dimensionless quantity value that emerged from the secondary analysis of a matrix of correlation coeffcients. It was Spearman who frst had the insight that the correlation between any two measures could be infuenced by errors that had nothing to do with sample size. The insight led him to introduce the concept of reliability and a method for using estimates of reliability to adjust a correlation coefcient for attenuation caused by measurement error. At the heart of Spearman’s method for disattenuating a correlation coefcient was a simple linear error model that made up the crux of his approach to factor analysis and would also be taken up by others to derive the core results of classical test theory. In Chapter 9, we take a close look at Louis Thurstone (1887–1955) and his approaches to what would become known as educational and psychological scaling, with a particular focus on his law of comparative judgment. Thurstone introduced a test scaling approach as a superior alternative to Binet’s mental age scale and his law of comparative judgment as a more complete formulation of the implicit model underlying Fechner’s work. In both cases, Thurstone followed Galton’s footsteps in using the normal distribution as a foundational assumption but relaxed an implicit constraint of constant variability. In his reconceptualization of psychophysics, Thurstone maintained the Fechnerian premise that

What Is Measurement?

19

psychological attributes were measurable but broadened the premise to apply to any psychological attribute, even those that could not be linked to a physical stimulus. The Thurstonian approach began with the presumption that any attribute, whether cognitive or afective, could be plausibly constructed to take the form of a continuous quantity, with a unit efectively defned by the randomness in human response tendencies. But Thurstone did not take these matters on faith, he proposed methods to be used to demonstrate that the results of his scaling approaches were invariant, the evidence he felt was necessary to conclude that a psychological attribute could be successfully measured. In the last chapter, we focus on Stanley Smith Stevens (1906–1973). We have already encountered Stevens’s infuential defnition of measurement as “the assignment of numerals to objects or events according to rules” earlier in this chapter. Stevens’s defnition included virtually all activities that involved numeric assignment as instances of measurement but made distinctions regarding the “level” of measurement that could be attained through his taxonomy of nominal, ordinal, interval, and ratio scales. Stevens also introduced the concept of permissible statistics, arguing that the types of statistical analyses that could be justifed depended on the strengths of the scales of the relevant measures. We will carefully unpack his defnition and accompanying taxonomy. In the process, we will see how the wording in Stevens’s defnition comes from the attempt to combine two diferent theories of measurement—representationalism and operationalism. Stevens proposed that the activity of measurement was one of numeric representation but rejected additivity as the crucial assumption and dropped altogether the formal distinction between an object and an attribute of an object. Stevens would argue that additivity was just one of many “rules” that could be used in any experimental procedure that required humans to relate numerals to objects or events. In a very literal sense, Stevens regarded humans themselves as the instruments by which sensory sensations could be measured. Stevens and other psychophysicists invoked a variety of diferent operational procedures in their experiments to elicit judgments about physical stimuli with respect to order, diference, and ratio. Under Stevens’s theory, each of these procedures resulted in an alternative operationalized measure, and each operation difered with respect to the strength of scale it was intentionally designed to produce. There is, at present time, no single accepted answer to the question of what is (and what is not) measurement in the human sciences. Perhaps a single consensus answer is unattainable, and indeed, there are likely many practicing educational measurers and psychometricians who have never given the question much thought. One reason for this, in my view, is that psychometrics is a young feld, and lacks a coherent disciplinary focus in the way that the concept and practice of measurement is taught to graduate students. Students tend to think of “measurement” as synonymous with the diferent psychometric models to which they have been introduced. The focus is often on understanding these models at what Borsboom (2005) would call the “syntactical” level. That is, students learn how to write out

20 What Is Measurement?

an expression of a model in statistical notation, distinguish known variables from unknown parameters, and then use real or simulated data to get estimates for these parameters. They learn the syntax. But the semantics, what the parameters really mean, and the rationale for why they are “measures,” is missing. If there was a distinction to be made between measurement and statistical modeling, it is typically lost. Years later, students may enter academia themselves or become practitioners in the survey, testing, and assessment industry. They are typically lacking a historical and conceptual foundation to come to their own understanding of what it means to do measurement in the human sciences. This book is an attempt to help provide this foundation.7

APPENDIX Some Important Statistical Concepts

There were several developments in statistics through the mid-19th century that greatly infuenced Fechner and Galton, whose work we encounter in the next three chapters, so I briefy review them here. Most of these ideas are now the foundation of most introductory courses in statistics (so for those who already have this foundation in place, it can be skipped). In mid-19th century, their applicability to the study of natural and social phenomena was just beginning to gain traction.

The Binomial Probability Distribution One of the oldest developments in probability theory, dating back to the mid17th century, is the concept of a frequency distribution used to communicate the likelihood of unique outcomes in games of chance. Consider the simplest example of any game of chance that involves, in some capacity, the fipping of a coin or the rolling a die, and two possible outcomes of interest (e.g., head vs. tail, even number vs. odd number, etc.). Each fip or roll is an independent random event in the sense that there is tangible uncertainty about the outcome (uncertainty that can be experienced directly through observation), and the outcome of one fip or roll has no impact on the outcome of the next one. If the sides of the coin or die are equally balanced, then the probability of observing any possible outcome on a single fip or roll should be the same. These are some of the most elementary ideas about probability, ideas that quickly get more interesting in the context of a sequence of outcomes. Say that we play a game in which we get to fip a coin 10 times. Afterward, we count up the number of times the coin landed on the “heads” or “tails” sides, and we earn $1 for each head and lose $1 for each tail.

22 What Is Measurement?

The range of possible outcomes we may observe in this situation can be characterized by a binomial probability distribution function: P (X = k ) =

n! n −k pk (1− p ) . k !(n − k )!

(A1.1)

The notation of the binomial distribution is easy enough to translate into our coin fipping scenario if we go from right to left of Equation A1.1. Defne a coin landing on the head side as a “success” and on the tail side as a “failure.” Now defne each fip of the coin as an independent event known as a trial. It follows that the term n represents the total number of trials, the term k represents the total number of successes out of n trials, and the term p is the probability of a success on each trial. In the example of our coin-fipping game, n  = 10, p  = 0.5, and k is unknown but could take on any one of 11 possible values from 0 to 10. (When n  = 1, the binomial distribution reduces to a Bernoulli n  n! (or, equivalently,   ) indicates, for distribution.) Finally, the term k  k !(n − k )! each possible value of k, the number of unique ways that k could result from a sequence of n trials). For example, when k = 5, 10 ! = 10 * 9 * 8 * 7 * 6 = 252. 5!5! 5* 4 * 3* 2*1 So there are 252 ways that 5 successes could be accumulated over 10 trials, and since the probability of any one of these sequences is 0.510 = .009766, the sum is 0.246. A plot of the distribution of (hypothetical) results for all possible outcomes is shown in Figure 1.2. Crucially, the binomial distribution represents a thought experiment with respect to possible outcomes, since for any single game involving 10 rolls of a

Example of a Binomial Probability Density Histogram With Normal Curve Overlay.

FIGURE 1.2

What Is Measurement?

23

die, what we observe is not a probability but a single number, the total number of heads. The application of the binomial formula makes it possible to predict what the distribution of outcomes will look like if the process (e.g., our coinfipping game) could be repeated indefnitely under the same fxed conditions. Hence it is worthwhile to distinguish between the trials of a binomial procedure, n, which are fxed, and the replications of the procedure, R, which are neither a fxed nor a variable condition of the distribution. To the extent that a number of replications are specifed at all, this will typically be some large number R used to simulate an asymptotic result as R approaches infnity.

The Law of Errors: The Normal Distribution The expected value, or mean, of a binomial distribution is np, while the variance is np(1 − p). Importantly, as n increases, the shape of a binomial distribution is also increasingly well approximated by a symmetrical bell-shaped curve. This curve is commonly referred to as the normal distribution, one of the most widely invoked mathematical expressions in modern statistics: f (x ) =

1 2πσ

e

2 −1 x −µ    2  σ 2 

.

(A1.2)

The expression in Equation A1.2 is the normal probability density function and difers most notably from the binomial distribution in both its exponential form and that it applies to a continuous variable, x as opposed to a discrete variable, k. Just as the shape of the binomial distribution is characterized by two parameters, n and p, the shape of the normal distribution is characterized by its mean, μ, and variance, σ2. The name “normal distribution” did not become conventional until the early 20th century, and Francis Galton was among three men whom Stigler (1999, 410–414) credits for both frst using and popularizing the term (the others were Benjamin Peirce and Karl Pearson). During the 19th century, it was more commonly referred to as the law of errors, the law of deviations, or the Gaussian distribution. The latter is in honor of the mathematician Karl Friedrich Gauss, who derived the basic expression8 in 1809 near the end of his monograph Theory of Motion of the Celestial Bodies Moving in Conic Sections Around the Sun. For more on the context in which Gauss derived the law of errors, see Teets and Whitehead (1999) and Stahl (2006).

The Central Limit Theorem The link among the binomial distribution, Gauss’s formula, and a law of errors is the central limit theorem, the original contribution of Pierre Laplace, frst introduced in 1777 and then further refned in 1811. The crux of the central limit theorem is that if a sum is taken of a series of independent random

24 What Is Measurement?

variables, the probability distribution function of the resulting sum across infnite replications will converge to the normal distribution as the series being summed gets increasingly large. Most powerfully, so long as the number of variables combined to form the sum is large, the sum will follow a normal distribution even if the individual variables that compose the sum do not. Given this, it becomes apparent that the ability to approximate a binomial distribution with a normal distribution is itself a special case of Laplace’s central limit theorem, since an outcome, X  = k, can be conceptualized as the sum of n independent random variables (e.g., 10 fips of a coin). Gauss’s derivation of the normal distribution had followed at least a century of speculation and concern about the accuracy of observations being made by astronomers about the locations of celestial objects (i.e., stars, moons, planets). Prior to Gauss, Laplace had already proposed a number of functional “error curves,” whereby the notion of an error was to be understood as the diference between the observation made by an individual about the location of a celestial object and the true position of that object. Note that in such contexts, the attribute of interest was distance, a continuous quantity. The connection of Laplace’s central limit theorem to Gauss’s proposed error curve provided a theoretical rationale for the conditions under which the curve would be expected to apply to natural and social phenomena. It required the availability of a variable that could be conceptualized as the result of a measurement procedure in which the “error” of measurement was itself the efect of many small independent random causes. Since the central limit theorem can be said to provide the “law” for the distribution of errors in Gauss’s function, for this reason, it is more accurate to refer to the law of errors as a synthesis of Gauss’s and Laplace’s contributions.9

The Probable Error The form of the normal distribution most common in the mid-19th century was f (x ) =

1 c π

−x 2

e

c2

,

(A1.3)

where the measurements x are already expressed as deviations from a population c2 mean (hence μ = 0) and σ 2 = . It was conventional in this era for statisticians 2 to characterize the dispersion of a normal distribution (c in Equation A1.3), with the probable error,10 computed as one half the diference in values from the 25th to the 75th percentiles of the distribution (i.e., half the interquartile range). In the present, this would be done by getting an estimate for the standard deviation, σ, computed as the root mean square of deviations. A probable error is c smaller than a standard deviation since σ = . Even more confusingly, by 2

What Is Measurement?

25

early in the 20th century, the standard deviation had replaced the probable error statistic described earlier, and the term probable error was now instead being used to characterize what we would now think of as a standard error (or the standard deviation of a sampling distribution). This was the way Spearman used the term in his writing discussed in Chapters 6 through 8.

Notes 1 We will encounter metrology in more detail in the next chapter. Briefy, however, the feld of metrology is devoted to the scientifc study of measurement and its applications in the physical sciences. The most important outward-facing product of metrology, along with the VIM, is the International System of Units. 2 In the physical sciences, it will be equally sensible to include events and phenomena along with objects as the “things” to which measurement applies. However, since this book is focused on the measurement of human attributes, for convenience, objects will usually serve as a convenient shorthand in place of “objects, events or phenomena.” 3 There is more to this distinction in the context of measuring a psychological attribute than I am letting on at this point. For the reader for whom such matters are new, it is probably best to leave this for later. For others, I ofer the additional rationale: If we proceed from the assumption that psychological attributes are refective latent variables, then it will always be the case that what we know about them comes indirectly either from the observations we make about their hypothesized efects on, say, test or survey item responses or, even less formally, from observations we make from our daily experiences. Because our observations about these efects are only indicative of order (e.g., some responses are deemed more or less correct; some people by their actions and statements appear to have more or less of the attribute), the structure of the underlying attribute should be considered an open question. Either an additional assumption or some form of empirical evidence is necessary if we choose to describe the attribute as quantitative. Therefore, initial speculations about a psychological attribute are always qualitative. Our starting point is to notice a particular attribute or, synonymously, a particular quality of an object. In a best-case scenario, one can argue that measurement involves the transformation of qualitative information into a quantitative inference. 4 Believe it or not, there are—theoretically at least—conditions when the answer to this could be yes, according to the theory of additive conjoint measurement (Luce & Tukey, 1964; Krantz, Luce, Suppes, & Tversky, 1971). 5 One of the earliest uses of the term construct that I have found in connection to educational and psychological measurement is in Thurstone (1947, 53). However, the earliest comprehensive expositions can be found in MacCorquodale and Meehl (1948) and, most notably, Cronbach and Meehl (1955). For a rich account and analysis of this historical development through contemporary usage in the context of “construct validity,” see Slaney (2017). For a dive into possible meanings of the term latent in the context of latent variable models, see Borsboom (2008). In this book, to the extent that I use the term latent, I only take it to mean “hidden” or “concealed.” In this sense, a latent attribute is synonymous with what I defned earlier as an intensive attribute. 6 In answering this question, Borsboom et al. (2004, 1061) have proposed a competing defnition of test validity: “A test is valid for measuring an attribute if and only if (a) the attribute exists and (b) variations in the attribute causally produce variations in the outcomes of the measurement procedure.”

26 What Is Measurement?

7 With the exception of Chapters 3 and 5, each chapter contains some technical material that will be easiest to follow for those who have taken introductory coursework in statistics and psychometrics. I provide a refresher on a few important statistical concepts in the appendix to this chapter. To the extent that there are mathematical equations to be found in this book, they are typically never more complicated than a series of linear equations to which algebraic reasoning has been applied. h −h2 x 2 e , where x already 8 The actual form of the expression Gauss derived was f (x ) = ˜ 1 represents deviations (i.e., errors) from the true value (i.e., μ = 0) and ° 2 = 2 . Here 2h h, which is the inverse of the dispersion of deviations, can be interpreted as the precision or accuracy of the observations around the mean. Stigler has argued that Gauss’s original expression is both more readily interpretable and mathematically manipulable 1 than the form that results from the substitution h = . 2° 9 For more on the Gauss and Laplace synthesis, see chapter 4 of Stigler (1986). 10 Galton himself regarded the term probable error as both unfortunate and misleading (since the most probable “error” in using the mean as an estimate of the value of a variable with a normal distribution is zero; Galton, 1889, 57) and would often simply represent it with the symbol Q or refer to it as a deviation unit.

2 PSYCHOPHYSICAL MEASUREMENT Gustav Fechner and the Just Noticeable Difference

2.1

Overview

Well beyond the written historical record, human beings have surely always speculated about issues that are at the core of the activity of measurement. Perhaps nothing in this speculation has engendered greater debate than the measurability of latent human attributes. Happiness, sadness, anger, jealousy, laziness. Hunger, pain, endurance, strength, fatigue. Intelligence, wit, creativity, idiocy. Can such qualities be quantifed? To be sure, in casual conversation, they are invariably a basis for comparison. One person seems happier than the other, or the same person seems happier than the day before. But from this apparent order can we discern magnitudes? Are there amounts of happiness? Is there a zero point at which all happiness is gone? Along the way to that zero point does happiness turn into sadness or anger? What would it mean to measure happiness? These are as much philosophical questions as they are scientifc questions. In the Middle Ages, Nicole Oresme (1320–1382) speculated about the measurability of “inner qualities” and used a mathematical argument to suggest that some subsets of qualities might be amenable to quantitative treatment, provided that one is willing (in the words of Joel Michell, 1999, 55) to engage in the “act of constructive imagination.” In this chapter, we shall explore one of the frst great examples of constructive imagination taken to fruition, the program of psychophysical measurement envisioned and enacted by Gustav Fechner midway through the 19th century. Let us take as a starting point a willingness to hypothesize that a psychological attribute exists and that humans vary in the magnitude of the attribute that they possess. In other words, we hypothesize that the attribute can be located along a unidimensional continuum. Now this is a fairly audacious assumption DOI: 10.1201/9780429275326-2

28 Psychophysical Measurement

to make. But having made it, we are still faced with a major challenge: the continuum does not come with a defned origin or unit, and in the absence of both, the paradigm for measurement we have internalized for many physical attributes seems unattainable. Before we go on to see how Fechner sought to overcome this problem, we need to take a moment to appreciate the critical role of units in the practice of measurement. To do this, we turn briefy to the feld of metrology, devoted to the scientifc study of measurement and its applications in the physical sciences. The domain of metrology is defned by the physical sciences such as astronomy, physics, and chemistry, in which the focus is on the study of natural objects and events that tend to be inanimate. For all intents and purposes, the feld can be said to have come into existence in the mid- to late 19th century for the express purpose of establishing and maintaining international standards for units of physical measurement. Although the advantages of uniform standards for lengths and weights had long been appreciated, and numerous attempts had been made toward this end throughout the 18th century and the frst half of the 19th century, it was only following the Metre Convention of 1875 (“the Treaty of the Metre”) that the three organizations responsible for guiding metrological practice came into existence: • • •

The General Conference on Weights and Measures (CGPM) The International Committee on Weights and Measures (CIPM) The International Bureau of Weights and Measures (BIPM)

The treaty signed by 17 countries at the Metre Convention established the metric system of units, a product of the French Revolution that had frst gained international attention in 1791, as an agreed-on basis for international commerce among participating countries. Beyond its responsibility to make and preserve standards for measurement units, the original work envisioned for the CIPM and the BIPM came in the design of instrumentation that was calibrated to these standards. Over time, however, the ongoing challenge taken up by metrologists employed by the BIPM (and elsewhere) was to discover methods for realizing standard units with respect to invariant laws of nature. This notion had been an aspiration for scientists since the 17th-century discovery by Galileo that the period of a pendulum (the time that it takes to swing from one end to the other) is a function of its length, not the distance from which it is released. This relaL , where T represents tionship can be expressed mathematically as T = 2π g the period of a pendulum’s swing, expressed in some unit of time, L represents the length of the pendulum, and g represents the force of gravity. In 1644, the French mathematician Marin Mersenne discovered that at 45 degrees latitude, a pendulum of a certain length (39.1 inches in the English imperial units

Psychophysical Measurement

29

prevalent of the time) will complete its swing in precisely 1 second, and the frst pendulum clock was built by the Dutch scientist Christiaan Huygens in 1657. This suggested an approach whereby either units of time could be defned relative to a fxed value of length or units of length could be defned relative to a fxed value of time. As an aside, it is important to appreciate that prior to Mersenne’s discovery linking the swing of a pendulum directly to the duration of a second, the concept of a second as an equal interval reference unit for time was purely theoretical—or at least purely defnitional. Only the hour of the day had an observable point of reference relative to the apparent movement of the sun registered on a sundial, and mechanical clocks were engineered so that they could approximate these measures. From this, the minute and the second were defned as smaller intervals within the hour, on a mathematical basis, before it was possible to measure them with any accuracy. A minute was 1/60 of an hour, a second was 1/60 of a minute. Prior to the pendulum, mechanical clocks could only provide an approximation of these seconds and minutes in the movement of their escapement mechanisms, and it is quite likely that some seconds and minutes were longer than others. With the advent of pendulum-driven clocks, the measurement of time was, for the frst time, fully divorced from the light of the sun. When metrology ofcially came into being, the two base units of the metric system, the meter and the kilogram, were not, in fact, defned relative to the oscillation of a seconds pendulum but instead according to two physical artifacts and prototypes which were distributed among member countries. The intended rationale for the meter prototype had been a fraction (1/10,000,000 to be precise) of the distance from the equator to the North Pole along the meridian passing through Paris (with a kilogram then defned as the mass of a cubic decimeter of water). At this early stage, however, the artifacts themselves defned length and weight, since neither this distance along the meridian nor the length of a pendulum could be replicated with sufcient accuracy under diferent conditions to reproduce identical units. The success of metrology as a feld has come in the progress made in realizing these and other units with ever-increasing accuracy over the ensuing decades. What began as a fairly modest system of metric units for length and weight would evolve during the 20th century into an International System of Units (known as the SI) frst established in 1960. As initially conceptualized, there were three components to the SI: (1) base units—meter, kilogram, second, ampere, degree Kelvin (later known just as the kelvin), and candela (a seventh, the mole, was added in 1972); (2) derived units—units defned as products of powers of the base units; and (3) prefxes—the level or power at which the unit is to be expressed (e.g., the decameter, hectometer, and kilometer represent the meter multiplied by 101, 102, and 103, respectively, while the decimeter, centimeter, and millimeter represent the meter multiplied by 10–1, 10–2, and 10–3.

30 Psychophysical Measurement

However, if we jump ahead in time to May 20, 2019, 144 years after the Metre Convention, we fnd that the SI, while superfcially retaining the same structure with respect to base units, derived units, and unit prefxes has undergone a substantial redefnition. All the units of the SI, whether base or derived, can now be defned in terms of seven invariant numeric constants (shown in Table 2.1), each of which can be generated (or perhaps it is better to say engineered) in an experimental laboratory setting. For the frst time in human history, none of the seven base units of the SI are defned with respect to a physical artifact. Gustav Fechner’s career overlapped with that of James Clerk Maxwell (1831– 1879) and William Thomson (1824–1907, also known as Lord Kelvin), two mathematical physicists whose emphasis on the distinction between fundamental and derived measurement units anticipated the eventual structure of the SI. Maxwell and Thomson had become strong advocates for a coherent system of fundamental and derived units, in large part, because of the advances they were contributing, respectively, to the theoretical and empirical understanding of

TABLE 2.1 The SI Base Units and the Defning Constants Used to Realize Them

SI Base Units Symbol

Name

Quantity

s m kg A K mol cd

second meter kilogram Ampere kelvin mole candela

time length mass electric current thermodynamic temperature amount of substance luminous intensity

Symbol

Name

Exact Value

∆vCs

hyperfne transition frequency of Cs speed of light Planck constant elementary charge Boltzmann constant Avogadro constant luminous efcacy of 540 THz radiation

9 192 631 770 Hz

SI Defning Constants

c h e k Na Kc d

299 792 458 m/s 6.626 070 15 × 10–34 J·s 1.602 176 634 × 10–19 C 1.380 649 × 10–23 J/K 6.022 140 76 × 1023 mol–1 683 lm/W

Source: https://en.wikipedia.org/wiki/International_System_of_Units.

Psychophysical Measurement

31

electricity, magnetism, light (culminating in Maxwell’s 1873 book A Treatise on Electricity and Magnetism), and thermodynamics (Thomson had introduced his absolute thermometric scale in 1848). The importance of a system of standard units to scientifc advance was articulated by Thomson—by then Lord Kelvin—as part of an address (“On Electrical Units of Measurement”) given to the British Institution of Civil Engineers in 1883. In this ambitious and wide-ranging talk, Kelvin acknowledges the proliferation of (sometimes competing) units for the measurement of electrical properties and the challenge of realizing these units with sufcient accuracy. He describes the value of an absolute system of measurement in which all units can be derived from a base consisting of units for length (centimeters), mass (grams), and time (seconds): the “cgs” system of units. Kelvin’s vision, shared by Maxwell, was to move from units defned according to physical artifacts to units defned by constants of nature. In short, he was envisioning the SI of 2019. It is somewhat unfortunate given the scope of Kelvin’s address that many are only familiar with the second of the two sentences Kelvin chose for his opening: I often say to you that when you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of [a] meager and unsatisfactory kind; it may be the beginning of knowledge, but you have scarcely in your thoughts advanced to the state of Science, whatever the matter may be. (Thomson [Lord Kelvin], 1889, 73–74) The preceding sentence, frequently omitted, reads: “In physical science the frst essential step in the direction of learning any subject is to fnd principles of numerical reckoning and practical methods for measuring some quality connected with it.” The omission is unfortunate because, in the frst place, it gives the impression that Kelvin’s quote applies to measurement in any domain, but in the second place, it keeps the reader from wondering what Kelvin was driving at with his preconditions of “principles for numeric reckoning” coupled with “practical methods for measuring.” The principles to which Kelvin referred represented the role of theory in the form of causal laws underlying the attribute of measurement (e.g., changes in thermal energy in the form of heat causes an observable expansion of mercury) and the practical methods represented the engineering challenge (e.g., thermometry) of designing instrumentation capable of transducing variability in the attribute. I provide this background to give the reader a sense of the tide in which Fechner was swimming when it came to the developing practices of measurement in the physical sciences during the mid- to late years of the 19th century. To the extent that there were controversies to be found about physical measurement, it was in competing proposals for standard units, the extent to which the technology of the

32 Psychophysical Measurement

era could be used to harness the laws of nature in realizing the units, and the political challenge of fostering national and international consensus. There was, however, no real debate to be found among physicists as to the meaning of measurement. Indeed, Fechner defned the activity with words virtually identical to those used by Maxwell and Kelvin: “the measurement of a quantity consists in ascertaining how often a unit quantity of the same kind is contained in it” (Fechner, [1860] 1986, 38). Still, Fechner charted a new course relative to the mainstream of physical measurement in two related contentions. First, Fechner would argue that any distinction between fundamental and derived units of measurement was an illusion, that all measurement—even of the attributes of length, mass, and time—involved the search for some external standard for which comparisons could be made. In this, when we consider the most recent change to the SI (see Figure 2.1), we can say with hindsight that Fechner was rather prescient. Second, Fechner believed that every process that happens in the world, whether physical or psychological in origin, can be reduced to atomic movements and that these movements could be understood as exchanges of kinetic energy. Therefore, Fechner could see no logical reason to restrict the concept of measurement to the domain of physical attributes. The challenge, as Fechner saw it, was to devise methods for expressing the amount of a psychological attribute with respect to something that was related to it in a spatiotemporal sense. But this was a challenge that could be approached in a manner that was analogous to the methods of physical measurement. In Section 2.2, we start with an overview of Fechner’s background and unusual career trajectory. Of particular interest is Fechner’s theory of measurement. To the extent that readers are familiar with Fechner, it is most likely that their familiarity comes in the form of “Fechner’s Law,” which stipulates that psychological sensation is measurable as a logarithmic function of physical stimulus. We will see how this formulation was infuenced both by Fechner’s broader theory of measurement and by the empirical research fndings of Fechner’s colleague, Ernst Weber. In Section 2.3, we go into the details of one of the most well-known experimental methods that Fechner applied to produce a scale that was interpretable with respect to psychological units. To pull this of, Fechner made two critical assumptions: (1) that just noticeable diferences in physical magnitudes always invoked the same diference in psychological sensation and (2) that human errors in noticing diferences between distinct physical magnitudes could be modeled according to a normal distribution. In Section 2.4, we consider criticisms of Fechner’s psychophysical approach to measurement. We conclude with a brief discussion of Fechner’s legacy to measurement in the human sciences.

2.2 The Origins of Psychophysics 2.2.1

Fechner’s Background

Gustav Theodor Fechner was born in 1801 in the village of Gross-Särchen in Lower Lusatia, close to the present-day borders between Germany and Poland

Psychophysical Measurement

33

to the east and the Czech Republic to the south. Fechner’s father, grandfather, and uncle were all pastors, and he was raised in a family in which there was some precedence for a tug-of-war between the spiritual and the scientifc. To wit, Fechner’s father was said to have evoked controversy by placing a lightning rod on the roof of his church. Fechner’s father died when he was just 5 years old, and rather than follow in the family tradition, by the age of 16, he left to study medicine at the University of Leipzig, a city and institution with which he would be afliated for the rest of his life. During his undergraduate studies, Fechner renounced religion and became an atheist, but he soon decided he was not well suited for a career in medicine (“in part I felt that I had absolutely no practical talent for that profession”).1 He passed his baccalaureate examination at the University of Leipzig but stopped short of completing a doctorate degree in medicine. As part of his coursework at Leipzig, Fechner was exposed to a

FIGURE 2.1

Gustav Theodor Fechner (1801–1887).

Source: © Getty Images.

34 Psychophysical Measurement

broad array of topics in the physical sciences, and while few of the lectures he attended were inspiring, a series of lectures on physiology by a young Ernst Weber captured his imagination and encouraged him to study mathematics. During the 1820s, Fechner’s interests ping-ponged between philosophy and physics. Inspired by reading Lorenz Oken’s book Philosophy of Nature, he committed himself to the pursuit of an academic career in philosophy. He completed the equivalent of a doctoral degree in philosophy by 1823 yet, during the same time, made money by translating French handbooks on topics in physics and chemistry into German.2 By the mid-1820s, Fechner was writing both scientifc articles and philosophical books and giving lectures in physics at the university. He became increasingly immersed in the study of physics, and after Georg Ohm established the mathematical relationship among electric current, resistance, and force that became known as Ohm’s Law, Fechner performed a series of experimental tests of the law that led him to publish a paper in 1831 on the measurement of direct currents. This and other work helped him establish a reputation for himself as a physicist, and by 1834, at the age of 33, he had been given an appointment at the University of Leipzig as a tenured professor of physics, just one year after he had been married Clara Volkman, the sister of the physiologist and his colleague at Leipzig, A. W. Volkman. Over the ensuing six years, Fechner kept up the research program of an aspiring experimental physicist while also fnding the time to pursue his interests in philosophy, anonymously publishing the book The Little Book of Life After Death in 1836 under the pseudonym Dr. Mises. His career came crashing to a halt in 1840, when through some combination of mental exhaustion and physical injury, Fechner found himself unable to maintain either his professional duties or personal familial commitments. He resigned his chaired position in physics (although he maintained his university afliation and pension) and became an invalid. He was depressed, ate next to nothing, refused to speak, and spent most of his time in a darkened room. He wore a mask covering his eyes when walking outdoors. Suddenly in October 1843, for no apparent reason, Fechner recovered. He made no attempt to return to his academic appointment; instead, he turned almost exclusively toward the development of his philosophy of nature. He spent the next seven years outside the public eye conducting the experimental work that would culminate in Elemente der Psychophysik (Elements of Psychophysics),3 frst published in 1860 when Fechner was 59 years old. It was in this year and from this unusual source that both theory and methods were introduced for the purpose of measuring a psychological attribute. Perhaps no English writer has captured the unusual career arc of Fechner more succinctly than Boring (1950), who put it this way: This then was Fechner. He was for seven years a physiologist (1817–1824); for ffteen years a physicist (1824–1839); for a dozen years an invalid (1839 to about 1851); for fourteen years a psychophysicist (1851–1865); for eleven

Psychophysical Measurement

35

years an experimental estheticist (1865–1876); for at least two score years throughout this period, recurrently and persistently, a philosopher (1836– 1879); and fnally, during his last eleven years, an old man whose attention had been brought back by public acclaim and criticism to psychophysics (1876–1887)—all told three score years and ten of varied intellectual interest and endeavor. If he founded experimental psychology, he did it incidentally and involuntarily, and yet it is hard to see how the new psychology could have advanced as it did without Elemente der Psychophysik in 1860. (283)

2.2.2

Fechner’s Conceptualization of Measurement

The details of Fechner’s philosophy of science and how it can be situated in prevailing perspectives among his predecessors and contemporaries of the 19th century are outside the scope of this chapter and frankly beyond my own limited ability to understand and convey (for this see Heidelberger [2004] and the many references contained therein). Instead, I consider only those aspects of Fechner’s philosophical worldview that are especially relevant to his theory of measurement. Central to Fechner’s philosophical explorations was the belief that the activities of the body were inseparable from the processes of the mind and vice versa. Stated more broadly, the idea extended to how humans come to understand the natural world. Fechner articulated a “day view” of the world that involved the accumulation of qualitative experience and mental introspection which stood in contrast with a “night view” of the world that involved the discovery and expression of physical laws that govern the relationships between inanimate objects and events. The day view was holistic; the night view, reductionist. But each represents one side of the same coin. As the story goes, while contemplating the mind–body relationship on the morning of October 22, 1850 (Boring, 1950, 280), Fechner had the epiphany that a consequence of this worldview was that it must be possible to measure an increase in mental energy with respect to a corresponding increase in bodily energy. One expression of this epiphany comes early on in Elements (p. 23): The whole of nature is a single continuous system of component parts acting on one another, within which various partial systems create, use, and transmit to each other kinetic energy of diferent forms, while obeying general laws through which the connections are ruled and conserved. Since in exact natural science all physical happenings, activities, and processes, whatever they may be called (not excluding the chemical, the imponderable, and the organic) may be reduced to movements, be they of large masses or of the smallest particles, we can also fnd for all of them a yardstick of their activity or strength in their kinetic energy, which can always be measured, if not always directly, then at least by its efects, and in any case in principle.

36 Psychophysical Measurement

As a concrete example, imagine that you are playing in a game of basketball. A teammate passes the ball to you and you catch it. According to Fechner, there are two equivalent ways to conceptualize what has just transpired. The frst way is to focus on the ball as the object (i.e., body) of measurement. Hence, you might seek to measure the force on the ball when it makes contact with your hands, and through the application of Newton’s second law of motion, this could be estimated as a product of the ball’s mass and acceleration. The second way is to focus on yourself as the object of measurement, and in this case, you seek to measure the psychological sensation produced when you catch a ball that has been thrown to you with a certain force. To Fechner, this sensation is no more and no less than a transformation of the kinetic energy of the ball into kinetic energy in the brain. The challenge is to fgure out the functional form of the transformation. But where do we start? Fechner saw measurement as a matter of fnding a unit or standard that could be used to count equalities in some target quantity. For this to be possible, three principles had to hold. First, it must be possible to conceptualize an attribute (e.g., psychological sensation) as something that increases or decreases in a continuous manner. Second, it must be possible to reliably produce and discern a diference within this continuum that can be compared to any other two magnitudes along the continuum. Third, it must be possible to discern when the magnitude on the continuum is zero. If all three principles are met, we fnd ourselves with the classical conception of measurement. The problem, when it comes to the measurement of a psychological attribute, is that while it might be possible to convince ourselves that the frst and third principles are at least logically plausible, the second principle seems unobtainable: How can we expect to reliably discern the same increment of a latent attribute, and even if we could, how can this be compared to some target magnitude of the same attribute? Fechner (1860), while recognizing the challenge, argued that even in physical measurement, it is never the case that a unit is a pure instance of the attribute being measured: Sensation does not divide into equal inches or degrees by itself, units that we can count and summate. Let us keep in mind, however, that the same problem arises for physical magnitudes. After all, do we count periods of time directly in terms of time, when measuring time, or spatial units directly in terms of space, when we measure space? Do we not rather employ an independent yardstick, a measuring rod, which for time does not consist of pure time, nor for space of pure space and for matter of pure matter alone? Measuring any of these quantities demands something else as well. Why should the case not be the same in the mental or psychological sphere? The fact that the psychic measure has always been sought in the sphere of the purely psychic may so far have been the main reason for our inability to fnd it. (47)

Psychophysical Measurement

37

That is, even among the three physical attributes that Maxwell and Kelvin regarded as the foundation for a coherent system of units, no single attribute is measurable without taking the others into account. Even for the measurement of length, a choice must be made in the mass of the standard and the time when the length of some object is being compared to the standard. For time and mass Fechner’s point is even easier to appreciate. When measuring time by the hands of a clock, we use the clock as a substitute for time according to a measurement formula in the form T  = nD, where D represents a unit of distance (e.g., either a minute or an hour on most clocks) and n represents the number of these units that have been passed. We interpret the product of number and unit as our measure of the time of the day. We use the force exerted by gravity to measure an object’s mass as a function of weight. The standard units for the measurement of any physical attribute can be derived if the attribute can be shown to have a functional relationship with other attributes that have known units, or at least units that are spatiotemporally observable. In this sense, all physical measurement requires the specifcation of a “measurement formula,” a formula that gets instantiated in our choice of instrumentation. Beyond this, Fechner argued that the act of measurement always involves an estimate based on the mental impression that is made on the measurer. Heidelberger (2004) describes this as follows: To measure means to discover that a standard (or a standard multiplied, or only part of it) is equal to what is being measured. Discovering that equality rests on the subjective condition that the standard and whatever is being measured seem equal to the observer. . . . The observer’s subjective impressions play a fundamental part in observations of physical measurement: the observer cannot be eliminated. (199) Thus, Fechner reasoned that since physical quantities are ultimately understood as measures through a psychological interpretation, it must be possible to invert the relationship such that psychological quantities could be understood through a physical interpretation.

2.2.3

Weber’s Law

Fechner (1860) defned psychophysics as “an exact theory of the functionally dependent relations of body and soul or, more generally, of the material and the mental, of the physical and the psychological worlds” (7). But on what theoretical basis could he defne this functional dependency between the physical and the psychological? For this, Fechner turned to a collection of fndings from relatively recent experiments that had been conducted in the 1830s and 1840s by Ernst Weber. Weber’s primary focus in these experiments

38 Psychophysical Measurement

was on the relationship between the sense of touch, weight, and temperature, but he also extended his experiments to include the ability of subjects to discriminate between the lengths of lines and the pitches of tones. His results were ultimately published in the 1846 book Der Tastsinn und das Gemeinegefuhl (The Sense of Touch and the Common Sense). Weber’s pivotal finding had been that when subjects are exposed to physical stimuli of two different magnitudes, the increment between magnitudes that is “just noticeable” is typically a constant fraction of the base stimulus. For the moment, we leave aside the issue of how one would determine that two magnitudes were just noticeably different (we turn to this in the next section). There are several equivalent ways to express Weber’s finding with a little bit of mathematical notation. Let X represent some physical quantity (e.g., weight) for which some finite set of magnitudes can be both observed and reproduced. Consider any pair of magnitudes, xa and xb , which are compared and found to be just noticeably different such that xb > xa . The first way to express Weber’s finding is that for any two such magnitudes,

xb = C, xa

(2.1)

which indicates that the ratio between two magnitudes that are just noticeably diferent is a numeric constant, C. If we subtract 1 from both sides of the equation, this can be reexpressed as xb − x a = C − 1. xa

(2.2)

A more general expression comes from defning ΔX = xb – xa as the change in any value of X that is necessary before a diference in magnitude will be just noticeable by a human subject and defning k = C – 1 as another numeric constant that is a simple transformation of C. Weber’s fnding, which has become known as “Weber’s Law,” then takes the form jnd (X ) =

∆X = k, X

(2.3)

where jnd stands for a just noticeable diference. So, for example, imagine that I lift a cofee cup that weighs 750 grams. If I lift a second cofee cup, the cup needs to weigh 780 grams before I notice that the second cup is heavier, making the jnd 30 grams. It follows that k in the Weber equation would be computed 30 as = .04 (while C  = 1.04). 750

Psychophysical Measurement

39

Now, according to Weber’s Law, if the same comparison were made between another two cups for which the base weight was twice as large, then the jnd should also be twice as large (i.e., .04(1500) = 60). To validate this, when one plots diferent values of x along a horizontal axis, and the diferent values of the experimentally obtained k on the vertical axis, the resulting points should fall along a horizontal line. In his book, Weber found values of k equal to 1/40, for the discrimination of weight, between 1/50 and 1/100 for lines, and 1/160 for tones. Although the generalizability of Weber’s Law is an open question, it does seem to rather nicely explain why I am far more likely to notice (and react angrily) when the cost of a latte increases from $4.00 to $4.50 than I am when the cost of a plane ticket increases from $400 to $420.

2.2.4

A Measurement Formula4

Fechner took the constant relationship between stimulus intensity and discriminatory perception described by Weber as a basis for the measurement of sensory intensity. His starting point was to relate specifc ratios of observable physical magnitudes to diferences in psychological sensation. Let X again represent a physical quantity (e.g., weight), while the variable Y represents a psychological attribute also presumed to be quantitative (i.e., the sensation of weight). Fechner specifed as his “fundamental formula” the relationship ∆Y = c *

∆X , X

(2.4)

which conveys Fechner’s assumption that irrespective of the base value of the stimulus (X ), all jnds (ΔX ) elicit the same change in sensation (ΔY ). The positive constant c provides some fexibility for a similarity transformation that places a jnd, expressed as a ratio of the known scale of the physical quantity, onto new scale of sensation diferences. Although Equation 2.4 established a basis for realizing the units that could comprise the measuring scale of a psychological attribute, it falls short of establishing a scale with absolute zero from which units could be counted. To this end, Fechner’s next move was to assume that ΔY and ΔX could be made infnitesimally small so that Equation 2.4 could be written dX as the diferential equation dY = c * , which can then be integrated as X Y = ∫ c*

dX , X

(2.5)

which leads to Y = c * ln X + C.

(2.6)

40 Psychophysical Measurement

The formula relating stimulus magnitude to sensation magnitude now has two unknown constants, but this can be simplifed by the assumption that there is a magnitude of physical stimulus xt that is too small to consciously perceive, so 0 = c * ln xt + C, and therefore, C = −c * ln xt .

(2.7)

Substituting Equation 2.7 into 2.6 leads to Y = c(ln X − ln xt ).

(2.8)

Fechner expressed this in logarithms of base 10 as X  Y = z log  .  xt 

(2.9)

Finally, if X can be measured at its absolute threshold value and set as the unit of measurement (such that, in efect, xt = 1), we are left with a further simplifed form of Equation 2.9 relating the level of a physical stimulus magnitude, X, to a level of sensory intensity Y:

Y = z log X .

(2.10)

This equation, which Fechner himself referred to as his “measurement formula” eventually became known as “Fechner’s Law.” Fechner made a distinction between “inner” and “outer” psychophysics that has an important bearing on the interpretation of his measurement formula. Namely, Fechner believed that there were two stages involved in the mechanism through which a physical stimulus was associated with a psychological sensation. In a frst stage, the stimulus, X, triggers some form of neural excitation, θ. We can express this as θ = f1(X ). In a second stage, the neural excitation produces a specifc intensity of perceived sensation, Y. We can express this as Y = f2(θ ). Inner psychophysics is all about understanding these two functional relationships. However, we begin with no empirical insights about θ, since we only observe X directly and Y indirectly. Therefore, the measurement formula of Equation 2.10 is an instance of outer psychophysics. Yet, if we put the two stages together, we see that in Fechner’s conceptualization, the model that combines inner and outer psychophysics would take the general form Y = f3(X ) = f2{ f1(X )}, leaving the interpretation of the logarithmic relationship in Equation 2.10 equivocal—it could arise from f1(X ), with f2(θ ) as an identity; from f2(θ ), with f1(X ) as an identity; or from the product of two unknown functions the combine to be logarithmic. For sensation to be interpretable as a psychological attribute that is meaningfully distinct from associated physical

Psychophysical Measurement

41

stimuli, Fechner would argue that the logarithmic relationship must come from f2(θ) and that the ultimate goal of psychophysics as a program of research was to gain insights into the workings of this “inner” mechanism (see Heidelberger, 2004, 205–207). In Section 2.4, we consider criticisms raised about the derivation of Fechner’s Law. For the time being, let us accept it as a motivating theory for the measurement of a psychological attribute. Now we have to turn to the obvious problem: to validate this law, we need access to values of Y. But these are unknown. Fechner’s strategy was to use experimental methods to estimate the values of jnds at diferent locations of a physical stimulus. Then, if it were possible to locate the absolute threshold of sensation with respect to the physical stimulus, and if, as had been assumed in his fundamental formula, all jnds triggered the same diferences in sensation, it followed that a scale of sensation could be constructed by the counting of jnds above the absolute threshold. With this scale in hand, one could examine the empirical relationship between Y and X. Fechner (1860, 59–111) described three diferent experimental and statistical approaches for the estimation of jnds: (1) the method of just noticeable diferences, (2) the method of right and wrong cases, and (3) the method of average error. All three approaches are meant to accomplish the same thing, but it was in his implementation and analysis of the method of right and wrong cases that he made what would become his most general contribution to psychological measurement through his use of the normal cumulative distribution function to represent the probability of a correct discrimination when comparing a pair of physical stimuli. This method came to be known as “The Method of Constant Stimulus” or simply “The Constant Method” (Brown & Thomson, 1921; Guilford, 1936). As we will see in Chapter 9, this method was taken up by Thurstone and became the basis for his law of comparative judgment. To those familiar with contemporary psychometric modeling through the use of item response theory, Fechner’s approach represented a frst instance of what we might think of today as an item response function (Bock, 1997). To make it easier to see the connections between Fechner’s psychophysics and how Thurstone later adapted it, in the next section, I describe5 Fechner’s method of right and wrong cases using notation and explanation inspired by Bock and Jones (1968).

2.3 The Method of Right and Wrong Cases (the Constant Method) 2.3.1

An Illustration of the Experiment

In what follows, let the variable xj represent a physical quantity that can be manipulated to produce a fnite set of n ordered values ( j = 1, .  .  .  , n) that

42 Psychophysical Measurement

will serve as stimuli in a psychophysical experiment. Each value in the series has the same base level, x 0, but difers by the increment, δ. Hence, x1 = x 0 , x 2 = x0 + δ, . . . xn = x0 + (n −1) δ. The choice of δ and the range of ordered values from x1 to xn is at the discretion of the experimenter, but the idea is to choose a value of δ that is not so large that the diference between two values of x is immediately self-evident (e.g., cofee cups that difer by 2 kilograms) but not so small that x j + δ could not be accurately reproduced in an experimental setting. Now we defne any specifc pair of magnitudes that will be pulled from the series x1 to xn as {xc , xt}. The magnitude xc is selected as the “standard” or “control” stimulus, and it will typically represent the middle value in the ordered series of magnitude values. This value remains fxed in all pairwise comparisons. The magnitude xt is the “variable” or “test” stimulus and will vary in each pairwise comparison. The crux of the experiment is to present a subject (or subjects) with one of n possible pairings {xc , xt}, with each stimulus presented in sequence. The subject is then asked to judge which of the two stimuli had the greater magnitude.6 The results from these comparisons can be generated using either a single judgment or multiple judgment approach. In a single judgment experiment, each subject judges only one pair of stimuli. So, if there are n pairs of stimuli in total and N subjects judge each pair, a balanced experiment would require nN subjects to produce a total of N judgments per comparison. In a multiple judgment experiment, each subject judges all pairs of stimuli and may be asked to replicate each judgment m times, producing a total of mN judgments per comparison.7 We can now defne Nt c to represent the total number of responses per unique {xc , xt} comparison, irrespective of whether the responses are generated from a single or multiple judgment experiment. Let i index either the judgment of a unique subject (in the case of a single judgment design) or a unique judgment that could come from diferent subjects (in the case of a multiple judgment design). We apply the following scoring rule for each judgment: st c i = 1 if the judgment is made that xt > x c , st c i = 0 if the judgment is made that xt < x c . The key summary statistic is the proportion of judgments for which stci = 1 N ∑ i=1tc stc i p = . computed as t c Ntc Fechner’s famous implementation of the experiment described above involved making comparisons between the weight of two containers. He recognized that

Psychophysical Measurement

43

people would vary in their sensitivity to diferences in weight, so to hold this constant in his experiments, he typically served as the sole subject and conducted multiple replications for each paired comparison (i.e., Fechner conducted a multiple judgment experiment in which N = 1 and m was typically 10 or greater). Fechner’s empirical research using this method with lifted weights spanned 1855 to 1859, and he reportedly recorded a cumulative total of 67,072 pairwise comparisons! To make the method more concrete with a specifc example, we can imagine having two ceramic cofee mugs, each weighing 760 grams. We can also imagine having a set of equally sized marbles, each weighing 10 grams. By flling a mug with zero to eight marbles, we can defne an ordered series of stimulus magnitudes ranging from 760 to 840 grams. Applying the notation introduced above, x0 = 760, δ = 10, and n = 9. We place 4 marbles in one mug and designate this as our control stimulus value, xc = 800. The experimental factor of interest is the incremental diference in weight between the test stimulus, xt , and the control stimulus, xc . This factor has 9 diferent levels, characterized by the ordered vector {−40, −30, −20, −10, 0, +10, +20, +30, +40}. Table 2.2 presents hypothetical results.8 The fourth column of Table 2.2 shows pt c , the proportion of judgments that xt > xc from a single subject (e.g., Fechner) who makes 20 judgments for each of n = 9 pairwise comparisons between cofee mugs. We will come to the meaning of the ffth column with the heading µˆ˘ t c shortly.

2.3.2

Applying the Law of Errors

Figure 2.2 provides a visualization of the relationship between the diference in weights for test and control cofee mugs (x-axis) and the proportion of times that the test mug has been judged heavier than the control mug (y-axis). TABLE 2.2 Hypothetical Results From an Experiment Judging Weight Diferences Using

the Method of Constant Stimulus xt

xc

xt − xc

pt c

µ˘ˆ t c

760 770 780 790 800 810 820 830 840

800 800 800 800 800 800 800 800 800

–40 –30 –20 –10 0 10 20 30 40

.05 .10 .20 .30 .45 .60 .80 .90 .95

–2.33 –1.81 –1.19 –0.74 –0.18 0.36 1.19 1.81 2.33

Note: The frst three columns are expressed in grams. Column 4 represents the proportion of responses in which the weight in column 1 was judged heavier than the weight in column 2. Column 5 is based on a transformation of column 4 using the inverse of the cumulative normal distribution function.

0.8 0.6 0.4 0.2

Proportion of Judgments t > c

44 Psychophysical Measurement

–40

–30

–20

–10

0

10

20

30

40

Difference Between Test and Control Stimulus FIGURE 2.2 Observed Results From Nine Comparisons of Weights Repeated 20 Times Each.

But from this it is not immediately evident how we should estimate an exact value for the jnd. By the setup of the experiment, it would at least appear we had assumed it would be somewhere between 0 and 40 grams in absolute value. But it is really only when the diference is 40 that we can be almost certain that the correct judgment will be reached. So there are two key questions: First, what value for pt c should we choose as the basis for locating the threshold for a discriminatory judgment? Is it at .50 when there are equal odds of a correct judgment relative to an incorrect judgment? There are some good mathematical reasons to favor this, but it seems like a bad choice in the present context because with just two stimuli being compared, there will be a 50% chance of a correct judgment through random guessing. Perhaps it should sit at .75, halfway between guessing and certainty. As it turns out, the latter is the threshold that Fechner used. Second, with the threshold defnition in hand, notice that no observed pt c matches .75 exactly, but we can at least narrow down the location of the jnd to somewhere between 10 and 20 grams. If we take this one step further and are willing to do some linear interpolation, we can further estimate the specifc value of the jnd relative to a base weight of 800 grams to be approximately 17.5 grams. This is not, however, the approach that Fechner took, and if we stopped here, we would miss one of Fechner’s key methodological contributions, a contribution that was to a great extent independent of the validity of his proposed measurement formula for psychophysics. With this in mind, let’s turn to the model Fechner employed so that he could not only estimate a jnd but also the sensitivity with which an individual could discriminate among stimuli.

Psychophysical Measurement

45

For each observable physical stimulus magnitude, xj , assume there is an associated latent sensation intensity that is invoked, Yj . This variable can be expressed as the sum of two components, a fxed value, μj , and a random value, ε j . Fechner assumed that this random component, or error, could be modeled as a realization from a normal distribution. Today such a move is so commonplace that few would think twice about it. But these sorts of applications of probability theory were not at all commonplace in the mid-19th century (e.g., see Stigler, 1986, 242–254). Figure 2.3 illustrates the implication of this assumption on the outcomes one would expect to observe if the same comparison between any pair of physical stimuli {xc , xt} could be replicated an infnite number of times. The left and right curves represent, respectively, the distributions of sensation intensities, Yc and Yt , evoked by the physical stimuli, xc and xt . Each distribution has a distinct mean, μc and μt , and a standard deviation, σ and σt . In Figure 2.3, these distributions have been plotted so that μt > μc and σc = σt . The greater the distance between μc and μt , the less likely we are to observe an incorrect discriminatory judgment. The spread of each distribution represents the sensitivity of an individual’s sensation to physical magnitudes: the greater the sensitivity, the narrower the spread.

µc

µt

FIGURE 2.3 Theoretical Results From a Comparison of Weight With Magnitudes xc and xt Over Infnite Replications. The scale of the x-axis is that of sensation intensity, not physical magnitude.

46 Psychophysical Measurement

When the values of {xc , xt} are being compared this stimulates a latent comparison that we can conceptualize as the diference between two random variables: Yt c = Yt −Yc = (µt − µc ) + (εt − εc ) = µt c + εt c .

(2.11)

The random component εt c can now explain why a subject might give inconsistent responses when comparing diferent pairs of physical magnitudes, or the same pair on multiple occasions. Consider Figure 2.4, which now depicts the theoretical distribution of Yt c over infnite replications of the same comparison. When εt c is large and positive, it becomes more likely that a subject concludes that xt  > xc , even if μt c is negative. When εt c is large and negative, it becomes more likely that a subject concludes that xt  < xc , even when μt c is positive (note that μt c is positive in Figure 2.4). In summary then, a subject should conclude that xt  > xc when Yt  > Yc , but this depends on the magnitude of two factors, μt c , which is fxed, and εt c , which is random. Because it comes from the diference of two normally distributed random variables, Yt c will also be normally distributed with an expected value of μt c and a variance, σt2c . To reiterate, Yt c , μt c , and σt2c all correspond to diferences between the variables Yt and Yc , μt and μc , and εt and εc . It is assumed that these

0

µtc

FIGURE 2.4 Theoretical Distribution of Diferences in Sensation Intensity Over Replications. The shaded area indicates the probability that xt is judged to be greater than xc .

Psychophysical Measurement

47

diferences in sensation intensity are evoked by the true diference in magnitude (xt  − xc ) when a subject is asked to compare the pair of physical stimuli. Given this, for any given comparison between xt and xc , the probability that xt  > xc can be computed as the area under the normal curve, which is  1  1  y − µt c  ∞ Pt c = P (xt > x c ) = P (st c i = 1) = ∫0 exp −    2  σt c  2π 

2

   dy.  

(2.12)

Equation 2.12 is a cumulative distribution function (cdf) and relates, for any given comparison of physical stimulus magnitudes, the probability of a response in which the test stimulus will be judged greater than the control stimulus to the parameters μt c and σtc2 . If this theory of errors in discriminatory judgment is correct and if these parameters were known for all possible combinations of {xc , xt}, then we would expect to observe a smooth “S-shaped” curve—what Francis Galton frst described as an “ogive”—connecting the points that were plotted in Figure 2.2. One of the things that this model brings into focus is a method—en route to the estimation of a jnd—of reexpressing known diferences in magnitude from a physical scale in terms of what up to this point has only been a hypothetical psychological scale of sensation diferences. The strategy goes like this: 1. 2. 3.

Relate each {xc , xt} diference to the observed proportion pt c . Use pt c as an estimate of Pt c from Equation 2.12. Take the inverse of the cumulative distribution function to infer a scale value for μ t c .

In employing this approach, and in keeping with the derivation of his mea2 surement formula, Fechner assumed that the error variance σt c remained constant irrespective of the pairing of {xc , xt}. As we will see in Chapter 9, Thurstone would later call this into question. Nonetheless, the assumption greatly simplifes matters, because if error variance remains constant across comparisons, then a scale of sensation diferences need only be defned by locations. Taking advantage of the symmetry of the normal distribution it can be shown9 that µ˘ˆ t c =

−1

( pt c )

2 = zt c 2.

(2.13)

Here, µˆ t c is an estimate of the unobserved diference in sensation intensity, −1 is the inverse of the standard normal cumulative distribution function, and zt c is a standard normal deviate. The term 2 is there to characterize the units in which these diferences are to be understood and implies that σt c = 2. The last column

48 Psychophysical Measurement

of Table 2.2 shows the values of µˆ˘ t c estimated for each {xc , xt}. Consider the frst row. We see that xt  − xc , and this is associated with a pt c of .05. It follows that zt c = −1(.05) = −1.64, so µ˘ˆt c = −1.64 2 = −2.32. That is, when the estimated probability of observing an error in judgment that xt > xc is .05, we would predict that mean sensation intensity of the test stimulus is 2.3 standard units less than the mean intensity of the control stimulus. Recall, however, that for Fechner estimates of sensation diferences, µ˘ˆt c were the means to the end of measuring sensory intensity levels specifed in terms of jnd units. We still need an estimate for the jnd based on all the data from this experiment. One way to do this, consistent with Fechner’s fundamental formula (Equation 2.4), is to express the expected value of the diference in sensory intensity as a linear function of the diference in associated stimulus magnitudes, such that E (Yt c ) = µt c = αc + βc xt c .

(2.14)

The parameters αc and βc relate observed diferences in stimulus magnitudes to the diferences in mean sensation intensity estimated by taking the inverse of the normal cdf. In an ideal experiment, in which all sources of confounding have been controlled, when xt c = 0, it should also be the case E(Yt c ) = μt c = 0, which implies that αc = 0 as well. When it is not, this implies a source of bias. The slope parameter, βc , is interpretable as the sensitivity of a subject (or subjects) to the stimuli being compared. If the increment between stimuli, δ, is known to be small, a large βc is indicative of either a single subject that is sensitive to small diferences in physical stimuli or of physical stimuli that is readily discriminated by some target population of subjects. Notice that equation 2.13 can be directly related to Fechner’s fundamental formula shown in Equation 2.4, with αc = 0, and βc = c/X (where X will be a fxed value for a given implementation of the constant method). One way that estimates for αc and βc can be generated is by finding the best fitting line that minimizes the n values of µ˘ˆ t c as a function of the n values of test vs. control stimulus pairings, xt c. (Many other methods have been suggested, see Bock & Jones, 1968, for details.) Figure 2.5 illustrates this using the data shown in Table 2.2 and depicts the regression of µ˘ˆ t c on xt c. The superimposed regression line fit by ordinary least squares suggests α˘ˆ c = −.06 and β˘ˆc = .06. With estimates for αc and βc in hand, it becomes straightforward to fnd the jnd value associated with stimulus magnitude xc . This is done by designating a conventional threshold for the jnd (also known as the diference limen). In Fechner’s experiments, he set these at pt c = .75 when xt  > xc or, conversely, at pt c = .25 when xt  < xc . The values of standard normal deviates associated with these two probabilities are ±.674. To estimate the value of the physical stimulus

49

1 0 –1 –2

Sense Differences (Latent)

2

Psychophysical Measurement

–40

–20

0

20

40

Weight Differences in Grams (Observed)

Results From Regressing Column 5 on Column 3 of Hypothetical Data in Table 2.1.

FIGURE 2.5

associated with the upper diference limen, we can frst apply equation 2.13 such that µ˘ˆ t c = .674 2 = .953; and then with this predicted value for the difference in sensory intensity in place, we can turn to Equation 2.14, and with some simple algebra, jnd (x c ) =

µˆtc − αc . βc

(2.15)

.953 + .06 Applying this to the example data from Table 2.1, jnd (x c )upper = = .06 −.953 + .06 17.22, and jnd (x c )lower = , so jnd(xc )lower = 16.89. A single value for .06 the jnd (xc ) in terms of the standard physical stimulus can be obtained by taking the mean of the absolute values of jnd (xc )upper and jnd (xc )lower, which, in this case, is 17.05. From this one can conclude that a cofee mug would need to weigh 817.05 grams before it would be judged to be noticeably diferent from a cofee mug of 800 grams. Note that in this example the jnd is fairly close to the value of 17.50 we fnd from a simple linear interpolation of the points shown in Figure 2.2. From here, there are two ways that one could go about fnding jnd values associated with other stimulus values. If Weber’s law were known to hold for the physical stimulus under investigation, we can compute the Weber constant, jnd (x c ) 17.05 k= = = .021. Now distinct jnd can be extrapolated for any given xc 800 range of X. A better approach, because it would put the generality of Weber’s

50 Psychophysical Measurement

FIGURE 2.6

The Anticipated Result If Fechner’s Law Holds.

Law to empirical test, would involve repeating the experiment multiple times using a new value for xc each time after adding or subtracting the jnd relative to the previous value of xc . If signifcantly diferent values were to be found for αc , βc , and k across experiments, it would raise questions about the applicability of Weber’s Law to the physical stimulus under investigation and, by extension, to the derivation and application of Fechner’s Law as a measurement formula. Once a sequence of jnds has been established, the fnal result will be two variables, one that represents a measure of sensation intensity Y in discrete jnd units and another that represents the measurement units of the original physical stimulus scale, X. The two sets of values can be plotted together. If the shape is logarithmic, this would be taken as support for Fechner’s Law, and it would appear to suggest that a psychological attribute, weight sensation, can be derived from a physical stimulus, weight. A plot of the result one would expect to fnd if Weber’s Law holds for weight magnitudes ranging from 1 to 25,000 grams (equivalently, .001 to 25 kilograms) is shown in Figure 2.5. Notice that once the weight increases to about 5 kilograms, it takes an increasingly large diference to register the same diference in sensation.

2.4 Criticisms There are many avenues available to criticize Fechner’s theory and method of psychophysical measurement, and all of them have been well traveled. It seems helpful to distinguish between two diferent classes of criticisms. The frst class largely accepted (or remained agnostic to), the premise that psychological attributes exist and can be measured according to the indirect approach Fechner had introduced, but took issue with his choice of methods and the

Psychophysical Measurement

51

generalizations they could aford. The second class largely granted (or remained agnostic to) the utility of Fechner’s methods as a means of gathering interesting physiological data, but posed a more fundamental question: Is psychological sensation measurable? The argument that sensation, and, for that matter, any psychological attribute, is not measurable has been characterized as the quantity objection (Boring, 1950; Michell, 1999; Heidelberger, 2004). I summarize the crux of the quantity objection frst and then turn briefy to a summary of criticisms related to the derivation of Fechner’s Law and his empirical method of constructing a scale through the concatenation of experimentally estimated jnd units.

2.4.1

The Quantity Objection

In the United States, an early and oft-cited articulation of the quantity objection came from the father of American psychology, William James, in his assessment of Fechner’s psychophysics: “Our feeling of pink is surely not a portion of our feeling of scarlet; nor does the light of an electric arc seem to contain that of a tallow-candle in itself ” (James, 1890 as cited by Boring, 1950). The problem, frst described in a critique by the French mathematician Jules Tannery in 1875, was that on logical grounds, psychological attributes lack the defning feature of classical measurement: additivity. The concept of additivity, which I introduced in Chapter 1, requires the ability to compose the magnitude of a quantity as a sum of parts. To establish additivity, it must be the case that an attribute can be decomposed and recomposed with parts that are homogeneous. The idea behind homogeneity is that the amount being added to create a sum is always of the same kind. Tannery provides two delightful examples as conveyed by Heidelberger (2004) that illustrate the problem in the context of psychological attributes. Regarding additivity, Tannery considers the sensation of beauty: We believe that [the sculpture] Venus de Milo is beautiful. Which spatial dimensions could be used to measure her beauty? Assuming that we were to fnd her arms, would the beauty of her arms be the same kind of beauty as that of the entire statue? Assuming that we re-attached the arms to the statue, would we have added one aesthetic sensation to another? Would the fnal sensation be of the same kind that we had for the original torso? (Heidelberger, 2004, 208–209) Regarding the lack of homogeneity, Tannery considers the sensation of heat: If you hold an object in your hand and the heat of that object increases, at some point the threshold of pain is reached. The original sensation (heat) is of [an] entirely diferent kind than the fnal sensation (pain).

52 Psychophysical Measurement

Entirely diferent nerves are involved in those sensations. And what is true for these two extremely diferent sensations also holds for all those in between, albeit to a lesser degree. (Heidelberger, 2004, 209) One of Fechner’s core principles was that all measurement is premised on forming an equality between an attribute of interest and a practically convenient extensive standard. Since the two were seldom—if ever—completely identical, all measurement is to a greater or lesser extent based on an indirect comparison. A measurement formula establishes a functional relationship between the attribute of interest and the units defned to “count up” the magnitude of the attribute. However, critiques from Johannes von Kries in 1882 (Niall, 1995), Adolf Elsas in 1886, and Hermann Helmholtz in 1887 pointed to a seemingly crucial distinction: in the physical sciences, the onus is on the scientist to develop a causal relationship between the quantitative attribute and the observations being used to make an inference about its magnitude. We are willing to regard an increase in the volume of mercury within a thermometer as a basis for measuring temperature because there is a causal arrow running not from the thermometer (which we can see) to temperature (which we can only sense) but because we believe that the arrow points in the other direction. We invert the causal relationship to do measurement, but there is an implicit promissory note attached, one that refects the ongoing eforts to uncover and understand the causal mechanism underlying the relationship. Ideally, as our theoretical understanding of the attribute increases, this gets refected in the engineering of instrumentation with increasingly accurate units of measurement. From this vantage point, there is a glaring problem with Fechner’s measurement formula: it is—at best—correlational, not causal. That is, the psychological attribute of interest was sensation, but clearly, it would be silly to assert that sensation causes a physical stimulus. If anything, the claim must be that a physical stimulus causes psychological sensation. One can choose to defne new units for the physical stimulus in terms of observed diferences in sensation as Fechner was doing, but the units would be completely arbitrary, and it represented a criticism that Boring (1921) would refer to as “the stimulus error”—the mistaken conclusion that a novel attribute was being measured that was distinct from its physical source. The quantity objection, as articulated by the mathematical physicists and philosophers of science who took stock of Fechner’s ideas, reimposed the sharp divide between the physical and psychological that Fechner was attempting to break: Mathematical psychology, psychophysics and physiological psychology— three absurd names! Mathematics cannot be applied any more than the concepts of movement and force can be applied; physics ends where

Psychophysical Measurement

53

causality no longer rules; and physiology has no further purpose, once it has fnished measuring an organism. . . . And sensation? It is not an object of scientifc knowledge; it is not part of nature; it has no reality for the mathematical physicist; it cannot be treated mathematically as a quantum. We claim this with all severity. (Elsas, 1886, as cited by Heidelberger, 2004, 231) The quantity objection poses two related challenges to psychophysical measurement. The frst challenge, captured above, is about the structure of the attribute. Since psychological attributes, such as sensation, are latent intensities (i.e., intensive attributes), there is no obvious way to show that they are additive as could be readily demonstrated for an extensive attribute. Intuitively, it seems easy enough to fnd compelling examples to dispel the notion that any given psychological attribute is plausibly additive and homogeneous (as James and Tannery demonstrate). However, the same logic could be taken to rule out the measurement of physical attributes that are also intensive (e.g., temperature, electric current). Unless we are prepared to conclude that no intensive attributes are ever measurable, the common challenge, whether an attribute is physical or psychological, is to fnd an indirect way to satisfy additivity or perhaps to formulate a new theory for what it means to speak of measuring an intensive attribute. The second challenge connected with the quantity objection is about causal theory. That is, the measurement of intensive physical attributes is premised on establishing a causal relationship between the target attribute and other extensive properties that can be measured in units of length, time, and mass. Volume is a function of length; velocity is a function of distance and time; force is a function of mass and acceleration; and so on. But in psychophysics, there is no theory that relates sensation causally to spatiotemporal units of measurement, and it would seem to defy logic to specify one. How one chooses to address the quantity objection depends to a great extent on matters of metaphysics and epistemology. If one agrees with Elsas in taking what Heidelberger describes as the “mechanistic” view of science, then sensation is not an object of scientifc knowledge. If science and mathematics can only apply to attributes that can be objectively observed and understood, then it is probably best for the psychophysicists to pack their bags and go home. But this was a position that was anathema to Fechner’s perspective, which was that psychological sensation and physical stimulus were simply two diferent ways of understanding the same phenomena in the natural world. Subjectivity is not something that can be eliminated by removing the observer and their sensations from the process of measurement. Therefore, to argue that sensation is not an object of scientifc knowledge would have been absurd to Fechner. Our psychological sensations do not cause physical stimuli, but they provide the motivation for deciding what physical attributes are worth measuring in the frst place and how we choose to measure them. Beyond this, while Fechner’s approach

54 Psychophysical Measurement

to outer psychophysics was correlative, in that he regarded sensation and stimulus as two diferent measures of the same kinetic energy; if combined with his inner psychophysics, it did have a causal form: X → θ → Y. The causal relationships involving θ (neural excitation) were inaccessible to Fechner at the time, but that might have been expected to change with advances in science. All this might have seemed a bit far-fetched to the mathematical physicists and aspiring metrologists of the mid- to late 19th century, as they were focused on understanding the natural world through the relationships of physical attributes, and in establishing a coherent system of units to make this possible. And indeed, it still may seem far-fetched over 160 years later in the 21st century. But the line between far-fetched and completely impossible will depend on whether one views the study of θ → Y as a legitimate subject for scientifc inquiry. In short, Fechner’s psychophysics is at least theoretically reconcilable with the causal challenge of the quantity objection. Heidelberger’s discussion of the way that the Austrian physicist Ernst Mach (1938–1916) incorporated Fechner’s principles of measurement into his own theory of physical measurement suggests another possible escape route from the causal challenge. Mach, like Fechner, would argue that to the extent that measurement activities are part of our daily experiences, this happens primarily on the basis of our willingness to substitute the units on a scale we can observe for the unknowable units of a scale we cannot. This was not to say that causation plays no role in measurement but that whether we observe a change in psychological sensations (X) that appear qualitatively distinct or a change in units of temperature in degrees Celsius (Y) that appear quantitatively distinct, they are both plausible efects caused by a change in heat (Z) we will never directly observe. When we choose to measure heat with a thermometer, we are essentially choosing to substitute the expansion of the volume of some substance for our own sensations because we have come to appreciate through experience that the two are adequately correlated, and we take each one as the efect of a change in heat. This does not mean that one is more “real” than the other, and indeed, there will be times when my mother will say she is cold even when a temperature scale suggests that she should feel warm. To Mach, there is nothing inherently superior to a measuring scale expressed in spatiotemporal units relative to a scale expressed in any other units—this is ultimately a matter of cultural convention since no matter the choice, the units will be a refection of an unobserved cause. This is not to say that anything goes, since the challenge of structure still needs to be met. For example, once we have decided to measure temperature according to the expansion of a substance, we still need to (a) choose a substance whose expansion captures the general efect of temperature on all types of matter, (b) determine a formulation that can return us a standard unit with respect to this expansion, and (c) defne a rule that maps the expansion of the substance to a series of numbers. We seek to accomplish these three things in a way that

Psychophysical Measurement

55

can indirectly satisfy the requirements of additivity even though temperature is an intensive attribute. But Mach’s point was that “we are always only dealing with a temperature scale that can be safely and precisely produced, and, in general, compared, but we are never dealing with a temperature scale that is ‘real’ or ‘natural’” (Mach, 1896 as cited by Heidelberger, 2004, 237). There are some good reasons to prefer measuring heat in terms of the temperature readings on a thermometer than in terms of the jnd units of sensation we could derive from psychophysical experiments. In Mach’s view, most scientifc measuring instruments can be viewed as that which represents an extension or improvement of our natural senses, and clearly, the use of a thermometer guarantees that when I speak of the temperature in Colorado and my mother speaks of the temperature in California, we can be sure we are talking about the same attribute. But from a logical perspective, when the attribute of measurement is a latent intensity, there is nothing that prevents us from choosing to measure in units of sensation instead of, say, degrees Fahrenheit. We prefer the thermometer because we trust that it can efciently and accurately transduce changes in temperature in a way that is more objective and trustworthy than our own senses. The point remains that if we can convince ourselves that the quantity objection can be met when measuring temperature, it can also be met, in principle, when measuring sensation or any other psychological attribute.

2.4.2 Derivation of the Measurement Formula and Estimation of jnds Apart from the quantity objection, the most common criticism of Fechner’s approach can be traced back to many of the assumptions made in the derivation of his measurement formula. The starting point of Fechner’s derivation, his “fundamental formula” involved equating a diference in sensation (ΔY ) to any ∆X jnd established according to Weber’s law (c * ). But the extent to which X Weber’s Law held for a variety of physical stimuli and the range of magnitudes to which it applied were open questions (Guilford, 1936). Nor was the functional form of the fundamental formula self-evident. Plateau (1872, as cited by Heidelberger [2004]) was the frst to suggest that if Weber’s law was to be equated to a change in sensation, the change should instead come in the form of the change in the ratio of sensation.10 To Cobb (1932), Fechner’s fundamental formula seemed more a matter of faith than science. Another questionable step in Fechner’s derivation was the assumption that the fundamental formula was diferentiable. After all, by defnition under Weber’s Law, there will be a range of values between magnitudes of X that are not discerned, making any functional relationship between X and Y discontinuous. Stadler (1878, as cited by Heidelberger [2004]) was the frst to raise this issue, and it was revisited again by Luce and Edwards (1958). Relatedly, consider the intended interpretation of Fechner’s measurement formula in practice. Recall

56 Psychophysical Measurement

Figure 2.6. The y-axis of this plot is expressed in jnd units. But what if we plug in a weight magnitude that produces a measure that is some proportion of the way between a count of jnds? Between any two centimeters, we have 10 millimeters, but the smaller units that exist between two jnd units can only be imagined. Finally, the simplifcation of Fechner’s measurement formula from Equations 2.6 to 2.10 depends on the assumption that it is possible to identify the location of an absolute threshold of the stimulus that produces no sensation. Doing so could lead to situations in which positive stimulus values below the absolute threshold produce negative sensations, something Fechner would attempt to ascribe to subconscious sensations. As Thurstone ([1927] 1959, 55) would come to argue, “the truth is that no one has yet found a just noticeable diference.” What Thurstone meant by this was that the location of a threshold for a jnd was, in the frst place, largely arbitrary; in the second place, depended on the continuity of the normal ogive; and, in the third place, that diferent experiments would typically fnd diferent jnds associated with the same physical stimuli. In one fascinating example, Stigler (1992) tells the story of how Peirce and Jastrow (1885) conducted what appears to be one of the frst instances of a randomized experiment in which they demonstrated that it was possible to detect diferences in weights that fell below the jnd thresholds Fechner had established in his earlier experiments. Other assumptions made in Fechner’s experiments were called into question as well, and indeed, in his book, Fechner seemed more interested in establishing the principles behind his approach than he was in defending its generalizability, something that he had promised to explore more fully in a forthcoming book he apparently never completed (Stigler, 1986, 250).

2.5 Fechner’s Legacy Prior to Gustav Fechner’s Elements of Psychophysics in 1860, the concept of measurement was broadly understood to consist of the estimate of a numerical value for an extensive attribute in standard units (e.g., space in meters, mass in grams, time in seconds) or the attempt to discover laws that allowed for the expression of intensive attributes in terms of products and powers of these base units. Measurement was the province of the physical sciences. The notion that psychological attributes might be measurable was a matter for philosophical discourse. To the extent that mathematics intersected with psychological practice, it did so within the context of studies of human physiology. By the turn of the 20th century, a feld of quantitative psychology was beginning to take shape, and within this feld, a broadened conceptualization of measurement was being embraced. Not coincidentally, this captured the attention of a number of prominent mathematicians and physicists, who began the attempt to formalize and

Psychophysical Measurement

57

debate the boundaries for what does and does not constitute measurement (cf. the writings of Ernst Mach, Hermann Helmholtz, Otto Hölder, Bertrand Russell, and Normal Campbell). Fechner’s psychophysics played a prominent role in prodding these developments. Fechner’s naturalistic philosophy that rejected the mind–body dichotomy, and his theory that all measurement depends on the specifcation of a measurement formula, whether in the context of extensive or intensive attributes, led him to the conviction that a psychological attribute (sensation intensity) was measurable through its relationship to a physical attribute (stimulus magnitude). He also introduced experimental methods whereby this could be achieved through the transformation of empirical evidence of order (e.g., perceived diferences in sensations) into a scale with an origin and units of sensation. Fechner’s approach was, to a large extent, still quite conservative in the sense that Fechner was not trying to redefne the classical meaning of measurement. He took the measurement of a quantity to consist of fnding out how often a unit quantity of the same kind is contained in it. The discovery of such a unit would still need to be derived from physical attributes with a spatiotemporal reference, and it required a theory (i.e., Weber’s Law) to inform the anticipated functional relationship between sensation and stimulus. His approach to measurement aroused controversy among mathematicians, physicists, and philosophers of science, but neither was it fully embraced by psychologists of the era (James, 1890; Titchener, 1905; Brown & Thomson, 1921). In particular, his method of establishing a scale composed of concatenated jnd units seems to have been largely rejected by his contemporaries. To the extent that psychophysical measurement was practiced, its emphasis would become the measurement of sensation diferences on an interval scale (i.e., µˆ t c from Equation 2.12). Michell (1999) ofers a critical appraisal of Fechner’s legacy. Michell argues that Fechner failed to adequately address the quantity objection for two reasons. First, Fechner did not recognize that the question of whether a psychological attribute is a continuous quantity (what Michell calls the “scientifc” task of measurement) is logically prior to the introduction of methods for establishing measurement units (what Michell calls the “instrumental” task of measurement). Second, Fechner did not devise or apply methods that were capable of turning the quantity objection into a matter that could be settled empirically. One such method was the Plateau–Delboeuf method of bisection (Brown & Thomson, 1921), the value of which Fechner may have misunderstood (cf. Michell, 1999; Fechner, [1887] 1987). When paired with the axioms for quantity later introduced by Hölder (1901),11 the method provided an empirical path whereby the quantity objection could be overcome. Michell argues that Fechner was guilty of the fallacy of assuming that evidence indicating a psychological attribute is ordered is all that is needed to conclude that it is measurable on a quantitative continuum (Michell, 2006; 2012a). In this sense, Michell sees Fechner’s legacy as establishing a dubious modus operandi, the idea that any

58 Psychophysical Measurement

psychological attribute can be measured in units that arrive to us through some statistical smoke and mirrors in a way that obscures rather than addresses the quantity objection. Ultimately, the work of Fechner and Weber that laid the foundation for psychophysics had a clear infuence on the thinking and practices of their German academic contemporaries, including Hermann Ebbinghaus (1850–1909), Hermann Helmholtz (1821–1894), and the father of experimental psychology, Wilhelm Wundt (1832–1920). It was Wundt who, in 1879, opened the frst laboratory devoted to the study of experimental psychology at the University of Leipzig, and the doctoral degrees available through this laboratory attracted an international array of students, including one American (James Cattell) and one Brit (Charles Spearman) who will fgure prominently in subsequent chapters of this book. Another of Wundt’s graduate students was Edward Titchener, a British experimental psychologist who eventually joined the faculty at Cornell University in the United States near the turn of the 20th century. Titchener played a prominent role in introducing a new generation of American students to the canon of experimental psychology, frst by translating two German textbooks by Kulpe and Wundt into English and then by authoring three books on his own, with the last one, Experimental Psychology, frst published in 1905, becoming a standard textbook for college and university instruction. It was in large part through Wundt and Titchener that the methods of psychophysical experimentation and the concept of discriminatory judgment as a basis for measurement entered the mainstream of academic instruction in the United States. The aspect of Fechner’s psychophysics that seems to have been the most universally embraced were the methods of experimentation and statistical analysis he had used to locate jnds and estimate the sensitivity of human discrimination in a way that accounted for random errors in human judgment (see Section 2.3). In the detailed descriptions of his experimental procedures to be found in Elements, Fechner also showed a perspicacious understanding of the need to control for extraneous factors that could be confounded with sensory intensity. He had conducted multifactor versions of his weight comparison experiments in which he controlled for use of left or right hand to lift a container, duration of the experiment, motivation, time of day, order of containers lifted, and the like. Stigler (1986) describes Elements as one of the frst formal treatments of methods for experimental design, predating R. A. Fisher’s classic work in this area by some 75 years. Fechner’s application of the normal distribution was also the beginning of a fairly long tradition in psychometrics of assuming that a human-generated phenomenon could be modeled using the normal distribution. The extent to which such a move is theoretically and empirically defensible would become a matter of debate (cf. Boring, 1920; Kelley, 1923). In the next chapter, we will see how Galton made use of the normal distribution to motivate measurement for the purpose of studying individual diferences.

Psychophysical Measurement

59

In an article commemorating the 100-year anniversary of Elements, Cyril Burt (1960) summarized Fechner’s legacy as follows: Fechner’s true claim to greatness derives, not from his specifc ‘psychophysical methods’ or from his cherished psychophysical law, but from the fact that he was the frst to suggest the idea of mental measurement as a basic technique, applicable to the whole feld of psychology, and at the same time to recognize that, in psychology in contrast to the more elementary sciences, this would entail the systematic use of statistical procedures. Before 1860 the idea of consistently applying experimental methods and numerical analysis to solve the problems of the psychologist was almost undreamt of; after 1860 they became an essential part of the science. (10)

2.6

Sources and Further Reading

The frst volume of Fechner’s Elements of Psychophysics and two shorter publications from 1882 and 1887 are the only works I could fnd that have been translated in English. The best resource I have found to arrive at a deeper understanding of Fechner’s theory of measurement described in Section 2.2 comes from Michael Heidelberger’s book Nature From Within (Chapter 6 in particular). Heidelberger provides invaluable secondary analysis of the historical reactions to Fechner’s work originally published in German and French. Outside of Heidelberger’s book, the next best biographical account of Fechner in English comes from E. G. Boring’s A History of Experimental Psychology. Much more has been written in English regarding Weber’s Law and Fechner’s Law (described in Sections 2.2.2 and 2.23) and regarding the experimental and statistical methods of psychophysics that Fechner’s work motivated. The best resources for this are part 1 of Brown and Thomson (1921), part 1 of Guilford (1936), and all of Bock and Jones (1968). For more on the history of metrology, see Crease (2011).

APPENDIX Technical Details of Applying the Cumulative Normal Distribution Function as Part of the Constant Method

With the observed proportions pt c given as the result from carrying out a method of constant stimulus experiment, the task remains to get estimates for μt c and σt2c . To do this, frst note that for a variable z with mean of 0 and variance of 1, the conventional form of a unit normal cdf is P ( z < 0) = ( z ) =

1 2π

1 2 z ∫−∞ exp[− y ]dy. 2

By symmetry of the normal distribution, P (z > 0) = (−z ) =

 1 2 ∞ ∫z exp − y  dy.  2  2π 1

(A.1)

Given an estimate p for P(z  > 0), −1

( p ) = −z .

(A.2)

Now in the context of the method of constant stimulus experiment, for any pair of stimuli, zt c =

(Yt −Yc ) − (µt − µc ) σt2 + σc2

=

Yt c − µt c σt c

.

(A.3)

The expression can be further simplifed in two ways. First, the expected proportion of correct judgments is always relative to an arbitrary fxed threshold,

Psychophysical Measurement

61

typically set at Yt c = 0 Second, we may assume, as Fechner assumed, that the variance of the error distribution that factors into sensory intensity comparisons is a constant. This assumption has some important implications. Recall that for any pair {xc , xt}, σt2c = σt2 + σc2 . Now, consider two diferent pairs, {xc , x1} and {xc , x2}. The assumption of constant error variance implies that σ1c2 = σ2c2 , which means that σ12 + σc2 = σ22 + σc2, and therefore, σ12 = σ22 . Hence, for any pair of stimulus magnitudes, {xc , xt}, σt2 = σc2 . We can specify a common unit12 for a sensory intensity diference scale by designating σt2 = 1. It follows that σt c = 2, and therefore, zt c =

(Yt −Yc ) − (µt − µc ) 2 t

σ +σ

2 c

=

Yt c − µt c σt c

=−

µt c 2

.

(A.4)

Because we only observe pt c , the proportion of times that xt is judged to be greater than xc , μt c is unknown. We can get an estimate for μt c by getting an estimate for zt c . Applying Equation A.5, we get −1 ( pt c ) = −zˆ˘t c . Therefore, z˘ˆt c =

µ˘ˆ t c 2

,

(A.5)

and so µˆ˘ t c = Φ−1 ( pt c ) 2 = z˘ˆt c 2 .

(A.6)

Notes 1 Fron Fechner’s handwritten biography from the Darmstadt Collection in the manuscript department of the National Library of Berlin, as transcribed and cited by Heidelberger (2004, 321). 2 According to Heidelberger (2004, 24) “between 1822 and 1838 he produced between ffteen hundred and two thousand printed pages of text yearly as a source of income.” 3 My reading of this book comes from an English translation of the frst volume in 1966 by Helmut Adler. There are two other volumes that have not been translated to English. A second edition of all three volumes of the book was published in 1889, two years after Fechner’s death, and a third edition was released in 1907, with Wilhelm Wundt listed as the second author. 4 In this section, I largely follow the approach taken by Guilford (1936, 152–154), Boring (1950, 287–289), and Heidelberger (2004, 202–203) in their presentations of Fechner’s derivation. 5 The presentation of Fechner’s approach here represents a signifcant departure from the mathematical notation and narrative that can be found in the English translation of his original book (Fechner, 1860 [1966], 85–89). For an account of Fechner’s methods that more directly translates Fechner’s notation into more modern statistical notation and that delves into other details in his factorial designs, see Stigler (1986, 242–254).

62 Psychophysical Measurement

6 In Fechner’s specifc implementation he treated the order in which the two stimuli were presented as a potential source of bias to be explored in a factorial design, but he did not randomize the order of presentation because he was both the experimenter and the subject of the experiment. As such, he was aware of the actual magnitudes of the stimuli he was comparing. In more modern times this would be an unacceptable practice, but Fechner believed he could be objective in reporting his judgment with respect to the comparisons of sensation intensity evoked. Fechner also allowed for a middle category when he was uncertain which of the two stimuli was of greater magnitude. He allocated these judgments equally to both the xt > xc and xt < xc . Later implementations of the approach typically forced subjects into one of the two categories, and this is the version of the experiment I recount here. 7 As I discuss in what follows, both the single and multiple judgment design could be used to motivate a statistical model for chance error in the recorded judgments. With that said, there are some tradeofs between the two approaches with respect to not only the necessary sample size of subjects (a larger sample of subjects is needed in the single judgment design) but also to assumptions of independence between the judgments (a multiple judgment design makes assumptions of independence less tenable). I don’t go into this here, but see Bock and Jones (1968) for details. 8 While the conditions in this scenario are fctitious, the proportions shown as results approximate those from an unpublished experiment on loudness discrimination reported by Eugene Galanter in Luce and Galanter (1963, 196). Luce and Galanter use these same data to illustrate the ftting of a normal ogive response curve in what was known as the “Phi-Gamma hypothesis.” 9 See the Appendix for details. 10 We will see how this criticism would be revisited by Stevens in Chapter 10. Note that changes to the fundamental formula would lead to changes to the measurement formula, and this led Stevens to argue for a power law. 11 For English translations see Michell and Ernst (1996; 1997). 12 Another way to do this would be to fx σt2c = 1, in which case σt2 = σc2 = 1/2. This is the approach taken by Bock and Jones (1968).

3 WHENEVER YOU CAN, COUNT Francis Galton and the Measurement of Individual Differences

3.1

Overview

In the psychophysics tradition initiated by Gustav Fechner, measurement was premised on the validity of laws that relate the functioning of human sensory attributes to associated physical attributes. The basis for the units of sensation in Fechner’s original work came from the observation of intraindividual variability, which Fechner modeled according to the law of errors (i.e., the inverse of the cumulative normal distribution). Although interindividual variability in sensory discrimination was recognized, psychophysicists showed little interest in describing this variability across some population of respondents or attempting to understand or explain the causes of this variability. In this chapter, we examine the roots of a distinct though complementary tradition for measurement in the human sciences, one premised on the view that the need for measurement is to explain the causes and consequences of individual diferences. The origin of this tradition comes from Francis Galton (1889), who, in his most important book, would write: It is difcult to understand why statisticians commonly limit their inquiries to Averages, and do not revel in more comprehensive views. Their souls seem as dull to the charm of variety as that of the native of one of our fat English counties, whose retrospect of Switzerland was that, if its mountains could be thrown into its lakes, two nuisances would be got rid of at once. An Average is but a solitary fact, whereas if a single other fact be added to it, an entire Normal Scheme, which nearly corresponds to the observed one, starts potentially into existence. Some people hate the very name of statistics, but I fnd them full of beauty and interest. DOI: 10.1201/9780429275326-3

64 Whenever You Can, Count

Whenever they are not brutalised, but delicately handled by the higher methods, and are warily interpreted, their power of dealing with complicated phenomena is extraordinary. They are the only tools by which an opening can be cut through the formidable thicket of difculties that bars the path of those who pursue the Science of man. (62–63) This passage, taken from Galton’s 1889 book Natural Inheritance, captures both Galton’s sense of humor and his evident fascination with variability and statistics. It was Galton, spurred by his interest in fnding a mathematical law underlying the mechanism of human heredity, who was most responsible for inventing and popularizing practical, efcient, and ingenious methods for the measurement of human attributes. It was Galton, like Fechner, who became captivated by the potential of the normal distribution as a tool for psychological measurement. But Galton did far more than Fechner to promote the ubiquity of the normal distribution, and he was the frst to argue that a standard deviation could be treated as a unit of measurement. It was Galton who discovered the phenomenon of regression to the mean and then generalized the phenomenon by expressing the regression relationship between any two standardized variables in terms of an index of correlation. Finally, it was Galton who embraced the utopian (or dystopian depending on one’s point of view) vision of a society in which all important human abilities would be measured through the periodic administration of surveys and tests. The value proposition for Galton was that only through measurement would it be possible to monitor societal progress and create incentives that would facilitate this progress. For the most part, Galton, in contrast to Fechner, was not someone who viewed the measurability of human attributes as a question of philosophical or scientifc interest in and of itself. Rather, Galton saw measurement as a means to an end, a requirement for scientifc investigation of both natural and social phenomena. To the extent that a precondition for measurement was necessary, it was that the attribute in question be plausibly orderable (i.e., it must be possible through observation to judge that some people have more or less on an attribute than other people). To the extent that standard units for measurement were unavailable, they could be invented and justifed by their practical utility. Galton’s favorite saying was “whenever you can, count,” and along these lines, he set out to show that in many cases, the counting up of frequencies in and of themselves could form a basis for measurement. Not only did Galton endorse the quantitative imperative of the time, but more than any other fgure of the 19th century and perhaps even beyond, he was at the root of it: Psychometry, it is hardly necessary to say, means the art of imposing measurement and number upon operations of the mind, as in the practice of determining the reaction-time of diferent persons. I propose in this

Whenever You Can, Count

65

memoir to give a new instance of psychometry, and a few of its results. They may not be of any very great novelty or importance, but they are at least defnite, and admit of verifcation; therefore I trust it requires no apology for ofering them to the readers of this Journal, who will be prepared to agree to the view, that until the phenomena of any branch of knowledge have been submitted to measurement and number, it cannot assume the status and dignity of a science. (Galton, 1879b, 148, emphasis added) In this and the next chapter, we will come to understand how Galton’s approach to measurement paved the way for the quantitative study of individual diferences as taken up at the turn of the 20th century by Karl Pearson and Charles Spearman in England and by James Cattell and Edward Thorndike in the United States. The next two sections of this chapter (3.2 and 3.3) provide context with respect to Galton’s background and three important ideas in the 19th-century context of mathematics and statistics that shaped his interests in interindividual diferences and their measurement. Sections 3.4 and 3.5 then focus attention on Galton’s use of the normal distribution as a method of “relative” measurement and the distinctions he made between relative and absolute measurement. In the next chapter, we turn to Galton’s eforts to promote an all-encompassing instrumental approach to human measurement for social study through the development of anthropometric laboratories, and his discoveries of the statistical phenomena of regression and correlation. We then refect on the overarching ambition that tied together Galton’s many inquiries and investigations over the roughly 40 years during the second half of his life, the study of eugenics. Galton’s discovery of regression and correlation is rightfully considered one of the great triumphs in the history of statistics (Stigler, 1999). Unfortunately, this and Galton’s many other accomplishments cannot be conveniently divorced from their use in propping up eugenics as an objective science and the negative consequences of the social policies that came to be justifed, in part, on this supposedly scientifc basis.

3.2 3.2.1

Galton’s Background The Polymath

Although it may seem that there is little that one can write about Galton that has not already been written elsewhere, I believe that his role in charting a course for measurement in the human sciences is, if anything, underappreciated. His fngerprints1 are seemingly everywhere. For someone in the 21st century hearing of Galton for the frst time, an immediate point of contact might be the opening sentence of Galton’s biography on the internet encyclopedia Wikipedia:

66 Whenever You Can, Count

Sir Francis Galton, Fellow of the Royal Society (16 February 1822–17 January 1911) was an English Victorian era statistician, progressive, polymath, sociologist, psychologist, anthropologist, eugenicist, tropical explorer, geographer, inventor, meteorologist, proto-geneticist, and psychometrician. Of all these descriptors, polymath, which connotes a person of wide-ranging knowledge and learning, seems most apt. Galton’s life story has the makings of a Hollywood movie, although it might be hard to settle on the genre. Would the story emphasize the adventure of his exploration of uncharted territory in Africa between 1850 and 1852? This and other travel became the basis for books Galton wrote on geography and travel, led to honors and membership as part of England’s Royal Geographic Society, and established him as an elite member of intellectual high society. Surely a Galton movie would need to have at least some elements of a comedy. How else to incorporate tales of Galton’s many eccentricities? This was a man who surreptitiously (and creepily) set about rating the beauty of the women he encountered in diferent British cities by poking strategically positioned holes in a piece of paper hidden in his coat pocket.2 He would also devise a method to measure the boredom of an audience during a Royal Society lecture by the number of “fdgets” observed per minute (Galton, 1908,  278). Would the movie show Galton as an iconoclast and equal opportunity ofender? Only Galton would publish an empirically driven argument that prayer had no efect on good health or life span (Galton, 1872; 1883). One would surely be tempted to pitch the heart of the movie as a story of discovery and triumph, given the accomplishments to which I have already alluded. Galton’s discoveries revolutionized the way that human data are gathered and analyzed for comparative purposes. To be sure, a Galton theatrical biopic could easily incorporate elements of adventure, comedy, irreverence, and discovery. And yet, it is not at all clear that such a story should be given a happy ending, given Galton’s promotion of eugenics as not just a feld of study but also a basis for social policy. Perhaps the movie would end on a note of irony, because for all his undeniable genius, Galton, although married, went through his own life childless. But we have gotten a bit ahead of ourselves.

3.2.2

Nature and Nurture

Francis Galton was born in Birmingham, England, in 1822, the youngest member of a wealthy and prominent family. His birth may have come as something of a surprise to his 38-year-old parents, given that their family already included six children ranging in age from 7 to 14, and having just recently moved to a three-story country manor (“The Larches”) that could comfortably accommodate them. The elder patriarchs of the Galton family were Francis’s paternal

Whenever You Can, Count

FIGURE 3.1

67

Francis Galton (1822–1911).

Source: © Getty Images.

grandfather, Samuel Galton, and maternal grandfather, Erasmus Darwin. Samuel Galton (1753–1832) was a Quaker who had successfully followed in his own father’s footsteps (also named Samuel Galton) at an early age to become the head of a business empire that initially began in gun manufacturing. In this capacity, Samuel Galton cultivated not only tremendous wealth but also a longstanding interest in science. In the pursuit of this interest, he would conduct, in his spare time, pioneering research on the concept of primary colors. This research led to his election as Fellow of the Royal Society in 1785 at the age of 32. Erasmus Darwin (1731–1803) was a physician of considerable fame and fortune who can be described, both literally and fguratively, as larger than life. On top of his active medical practice, he was a public intellectual, a progressive who argued in favor of the public education of women and the abolition of

68 Whenever You Can, Count

slavery. In 1775 he became a founding member of the Lunar Society, a dinner club and informal society where public intellectuals would regularly gather, and a network that would eventually bring the Galton and Darwin families into regular contact. He was also a prolifc writer, poet, and inventor. Erasmus Darwin’s personal appetites and proclivities were also legendary and perhaps a bit notorious, evident in his prodigious girth and the 14 children he fathered collectively from two marriages (Mary Howard and Elizabeth Pole) and one mistress in between the marriages (Mary Parker). The ffth child of his frst marriage to Mary Howard was Robert Darwin, a successful botanist in his own right but historically most famous as the father of Charles Darwin, born in 1809. The frst child of Erasmus Darwin’s second marriage to Elizabeth Pole was a daughter named Violetta, who would eventually marry Samuel Galton’s eldest son and heir to the family business, Samuel Tertius. By the time Francis was born to Tertius and Violetta, the Galton family business had transitioned from gun manufacturing to banking. For more extensive details and commentary on Francis Galton’s genealogy, see Pearson (1914) and Burt (1962). Francis Galton was the prince of the family, doted on by his four older sisters. His sister Adèle assumed responsibility for his early education through the age of 5, which, to Galton’s wry recollection in his autobiography, consisted primarily in memorizing the rudiments of Latin grammar and reciting English verse on command, something he apparently mastered with little difculty. By the age of 6, he was enrolled in a small school in Birmingham and could entertain guests at the Galton home by reading Shakespeare and solving arithmetic problems involving long division. With his two older brothers choosing careers in farming and the military, Galton’s parents invested in Francis the ambition that he would study medicine in a university setting, graduate with honors, and follow in the footsteps of his grandfather Erasmus Darwin to enjoy a long and fruitful career as a prominent doctor and public intellectual. Boarding schools in France and England followed, and by 1840, at the age of 18, he had already completed two of the four years of study required for a medical degree (in the frst year at Birmingham’s General Hospital and the second at King’s College in London). At this point, to his father’s chagrin, and at least in part upon the advice of his half cousin Charles, he chose to put his medical studies on a temporary hiatus to pursue a Bachelor of Arts with a specialization in mathematics at Cambridge University. (For more on the relationship between Francis Galton and Charles Darwin, see Fancher, 2009.)

3.2.3

A Brush With Failure

It was during his college years that Galton’s career trajectory took an abrupt turn as a consequence of two unexpected events. The frst event was the difculty he experienced in completing his degree at Cambridge with honors.

Whenever You Can, Count

69

The obstacle to this was no trivial matter, as it required fnishing among the top 100 students who took the multiday Mathematics Tripos, one of the world’s most challenging examinations (to be described in more detail in Section 3.3). Upon fnishing his third year at Cambridge, Galton chose not to sit for the Tripos and ultimately graduated with an ordinary (“poll”) degree in 1843. By 1844, he had returned to his medical studies without much enthusiasm when a second unexpected event took place: the death of his father, Samuel Tertius, with whom he maintained a close and regular correspondence. The resulting inheritance left him a wealthy man, and freed of family obligation, he decided to quit his medical studies altogether. Very little is known of Galton’s life during what Pearson would describe as his “fallow years”3 between 1844 and 1850, but at this formative stage in life, there was little outward sign that Galton would become anything other than a man of leisure in the British countryside, a role that his two brothers would quickly embrace. Fancher (1985b) argues that Galton’s childhood upbringing combined with his relative failure to excel in his study of mathematics while at Cambridge played a pivotal role in biasing him toward a largely hereditarian explanation for the variability in individual diferences he would soon set out to investigate. If Galton’s accounts during his time at Cambridge based on his letters to his father can be trusted, it seems clear that he invested his full eforts into his studies. These letters feature one amusing account of Galton’s attempts to stay alert when studying late into the night with the aid of his invention, the “Gumption-Reviver Machine,” a device he invented to drip water onto his head at a gradually increasing hourly rate. Still, he had struggled to keep pace with the intense demands, and this seems to have been at the root of several physical breakdowns. Did this make Galton more keenly aware of the limits for what could be achieved through efort and hard work? More prone to place weight on the need for a preponderance of natural talent? Fancher wonders whether Galton was describing himself in the opening chapter of his 1869 book Hereditary Genius: The eager boy, when he frst goes to school and confronts intellectual difculties, is astonished at his progress. He glories in his newly developed mental grip and growing capacity for application, and, it may be, fondly believes it to be within his reach to become one of the heroes who have left their mark upon the history of the world. The years go by; he competes in the examinations of school and college, over and over again with his fellows, and soon fnds his place among them. He knows that he can beat such and such of his competitors; that there are some with whom he runs on equal terms, and others whose intellectual feats he cannot approach. (1869, 56–57)

70 Whenever You Can, Count

On the other hand, it seems notable that Galton, in letters written to his father, never used a lack of natural talent as an excuse for his decision not to attempt the Tripos. Instead, he seems to have blamed this on the more specialized preparation in mathematics many of his peers had been given while he had been studying medicine and to the bad luck of his recurring bouts of bad health (Pearson, 1914). All that can be said for sure is that Francis Galton, of whom feats of academic success had been an expectation since childhood, fell short of meeting these expectations at Cambridge. The foremost founding father of applied statistics and psychometrics struggled in his study of mathematics. Between 1850 and 1865, Galton would transform himself from former academic prodigy to explorer, geographer, and meteorologist. The felds of geography and meteorology appear to have appealed to an obsessive tendency to measure. As part of his exploration of southwestern Africa in 1848 (the northern territory of the present-day country of Namibia), he had made regular use of a sextant, a watch, and a compass to situate the locations he encountered with respect to latitude, longitude, and distance. Upon his return to England, he turned these measurements into the frst detailed map of the region. He would also write a highly successful book, The Art of Travel. By the early he 1860s, he became intensely interested in the prediction of weather patterns in Europe and, to this end, gathered daily data on temperatures at three time points during the month of December from the fve European countries that made this available to him. Galton (1863) used these observations to infer the presence of what he would describe as an “anti-cyclone” and, through this, established a meteorological principle that informs weather prediction to the present day. Up to the early 1860s, however, Galton’s interest in measurement and its applications were restricted to physical objects and related natural phenomena. All of this should give the reader some inkling of Galton’s cultural milieu and some key professional milestones leading up to the accomplishments for which he is most famous, accomplishments that are the subject of this and the following chapter. But it doesn’t fully render the personality that, despite his apparent detachment and lack of empathy for those less fortunate, might make Galton a sympathetic fgure. A story told by Galton’s niece Millicent Lethbridge about a trip she took with her “Uncle Frank” in France (found both in the biographies by Pearson, 1930, and Bulmer, 2003), beautifully captures this dichotomy: I have an amusing recollection of a little trip to Auvergne which he and I took together in the summer of 1904.  .  .  . The heat was terrifc, and I felt utterly exhausted, but seeing him perfectly brisk and full of energy in spite of his 82 years, dared not, for very shame, confess to my miserable condition. I recollect one terrible train-journey, when, smothered with dust and panting with heat, I had to bear his reproachful looks for drawing a curtain forward to ward of a little of the blazing sun in which

Whenever You Can, Count

71

he was reveling. He drew out a small thermometer which registered 94°, observing: “Yes, only 94°. Are you aware that when the temperature of the air exceeds that of blood-heat, it is apt to be trying?” I could quite believe it!—By and by he asked me whether it would not be pleasant to wash our face and hands? I certainly thought so, but did not see how it was to be done. Then, with perfect simplicity and sublime disregard of appearances and of the astounded looks of the other occupants of our compartment, a very much “got-up” Frenchman and two fashionably dressed Frenchwomen, he proceeded to twist his newspaper into the shape of a washhand-basin, produced an infnitesimally small bit of soap, and poured some water out of a medicine bottle, and we performed our ablutions—I fear I was too self-conscious to enjoy the proceeding, but it never seemed to occur to him that he was doing anything unusual! (A letter from Millicent Lethbridge, daughter of Francis Galton’s sister Adéle, as quoted by Bulmer, 2003, 56–57, and Pearson, 1930, 447) In summary, Francis Galton was a man of biases and blind spots that afected the topics he chose to study, and these set constraints around the interpretations and inferences he was prepared to make on the basis of these studies. This may explain a proclivity to jump to a number of ill-founded conclusions. But he was also intensely curious, creative, and persistent. Outwardly shy, he was known to his friends, family, and many acquaintances for his generous and even-tempered disposition.

3.2.4

Heredity and Individual Differences

In On the Origin of Species, Charles Darwin had introduced and meticulously documented the concept of biological variability induced by natural selection. Although the concept evoked resentment and derision among the many intellectuals associated with the Church of England, in his half cousin Francis, Darwin had an immediate convert. Galton had long admired Darwin and, throughout his life, maintained a regular correspondence with him, one that would become more intense in their later years. Galton (1908) would recall the immediate impact of Darwin’s book: The publication in 1859 of the Origin of Species by Charles Darwin made a marked epoch in my own mental development, as in that of human thought generally. Its efect was to demolish a multitude of dogmatic barriers in a single stroke, and to arouse a spirit of rebellion against all ancient authorities whose positive and unauthenticated statements were contradicted by modern science. . . . I felt little difculty in connection with the Origin of Species, but devoured its contents and assimilated them as fast as they were devoured, a fact which perhaps may be ascribed to

72 Whenever You Can, Count

an hereditary bent of mind that both its illustrious author and myself have inherited from our common grandfather, Dr. Erasmus Darwin. (287–288) An implication of Darwin’s theory was that progress among the human species was more than the result of countless games of chance with God in the role of the card dealer. Nature makes implicit choices, and over time, these choices govern the evolution of diferent animal species. Darwin, of course, believed evolutionary change transpired gradually over many thousands of years, and his evidence related to such changes had been specifc to physical attributes. Galton was hardly alone in wondering whether such changes could be hastened if selective choices were to be made in the process of human procreation. Where Galton distinguished himself from the outset was in his conviction that it was not just physical attributes that were heritable and subject to evolutionary pressure but all variety of intellectual and moral attributes as well. That the physical characteristics of children were, to a large extent, inherited from their parents was common knowledge at the time, even if the mechanism of heredity transfer was not.4 But psychological characteristics? Contemporary views about theories of the mind had been strongly infuenced by John Locke (1632–1704) and one of Galton’s most famous senior contemporaries, the philosopher and political economist John Stuart Mill (1806–1873). In the 1843 book A System of Logic, Mill had presented a theory of associationism, which held that from birth through adulthood the mind is almost entirely shaped by the experiences and associations made through the environment to which a person is exposed (Fancher, 1985b). Galton (1869) took issue with an extreme rendition of this theory: I have no patience with the hypothesis occasionally expressed, and often implied, especially in tales written to teach children to be good, that babies are born pretty much alike, and that the sole agencies in creating diferences between boy and boy, and man and man, are steady application and moral efort. It is in the most unqualifed manner that I object to pretensions of natural equality. The experiences of the nursery, the school, the University, and of professional careers, are a chain of proofs to the contrary. (14) Galton’s studies related to the workings of human heredity and his study of anthropometrics cover a 25-year span between 1864 and 1889, while he was between the ages of 42 and 67. His fndings along the way would be announced by presentations at the Royal Society and other professional gatherings, and these, in turn, would be disseminated either as part of the written proceedings or published in popular magazines and journals (the journal Nature was one of Galton’s most frequent outlets). The chief results of his studies can be traced

Whenever You Can, Count

73

through the four books he published during this period: Hereditary Genius (1869), English Men of Science (1874), Inquiries Into Human Faculties (1883), and Natural Inheritance (1889). His use of statistical methods to support these inquiries laid a methodological foundation for the study of individual diferences in the social, behavioral, and biological sciences still frmly in place well into the 21st century.5

3.3 3.3.1

Three Infuences on Galton’s Thinking Quetelet’s Social Physics

The Belgian astronomer Adolphe Quetelet (1796–1874) was the frst to suggest that the normal distribution could be applied to explain deviations in the physical characteristics of humans. Quetelet had frst taken this approach and presented his fndings in the 1835 book A Treatise on Man and the Development of His Faculties. Most famously, he had collected and reported upon the measurements of the chest circumference and height of roughly 6,000 and 100,000 Scottish and French military conscripts, respectively. In tabulating and plotting these data, Quetelet would argue that the observed values generally followed the values that would be predicted by the normal distribution. As Bulmer (2003) points out, Quetelet ofered two diferent ways of interpreting a mean in the presence of deviations around that mean. Under the frst interpretation, the mean is in fact the true value of interest, that is, the actual location of a celestial object at a fxed point in time. Variability around this true value is the result of the errors made when taking repeated measurements. Under a second interpretation, the values being summarized pertain to unique objects, so there is no true value of interest, only a descriptive mathematical fact that arises from computing an arithmetic mean.6 Because Quetelet assumed that a law of errors only applied under the frst of these two interpretations, the repeated-measures context, the symmetric bell-shaped deviations surrounding this average were interpreted by Quetelet as the result of nature aiming for a true value of some physical characteristic of the “average man” but, by chance, missing the mark. Quetelet coined the term “social physics” for his approach of using probability theory, in general, and the law of errors, in particular, to make inferences about social phenomena, and it was an idea that Galton would employ in his own investigations, particularly with regard to the workings of human heredity. Galton read about these ideas in Quetelet’s 1849 publication Letters on Probabilities. Much like Quetelet, Galton was struck by what he came to view as the ubiquity of the normal distribution, and Galton did even more than Quetelet to popularize its application. However, while Galton would seize on Quetelet’s general fnding regarding the distribution of physical characteristics, he came to a different conclusion about the parameter of greatest interest. For Galton, it was

74 Whenever You Can, Count

not the mean but the deviations that would be critical to the quantitative study of heritability.

3.3.2

The Quincunx

Galton’s talents were most visible in his ability to take a complicated idea and use analogy and metaphor to communicate the essence of the idea to a broader audience. There is probably no better example of this than the “apparatus” he designed to “illustrate the principle of the law of error or dispersion” (Stigler, 1986, 277). The core of this apparatus, depicted as a graphic illustration in Figure 3.2, was a series of rows composed of equally spaced pins, with each successive row shifted to the right or to the left so that if a small metal ball were to be dropped into the frst row from above, it would have to strike a pin in each successive row while completing its downward descent. These rows of pins were mounted upright on a wooden board and enclosed within glass. On the top end of the rows of pins was a funnel; on the bottom end, a series of parallel compartments of equal height. The fnal ingredient to the apparatus was a collection of “shot”—metal balls with a circumference small enough to allow each ball to ft between the space between any two pins in the same row. When the apparatus was turned upside down, all the shot would accumulate above the funnel. Once turned right side up, the shot would descend through the rows of pins, with each metal ball eventually landing within one of the parallel

FIGURE 3.2

The First of Three Quincunx Illustrations Galton Included in Natural

Inheritance. Source: Galton (1889, 63).

Whenever You Can, Count

75

compartments at the bottom. By the time the shot had all been collected, its distribution took on the telltale shape of the normal curve. This was Galton’s quincunx, named after a Roman coin with a design in which four points make the shape of a square with a ffth point in the center of the square. This basic kernel of fve points resembled the core of any three of Galton’s rows of shifted pins. The same apparatus still exists with diferent names today, often going by the name of a “Galton Board” or a “bean machine.” What Galton had constructed was a means of simulating the results from replications of a binomial probability distribution (see the Appendix of Chapter 1 for details). Consider a single steel ball dropped into the top row of the quincunx. As it strikes a pin in this row, there is an equal chance the ball will fall to the left or the right of that pin before striking a new pin in the next row. This same chance experiment will be repeated for each successive row as the ball strikes a pin and falls to the left or right. If we count each movement of the ball to the right as akin to a “success” and each movement to the left as akin to a “failure,” then the fnal position of the ball in a compartment will depend on the total number of successes observed relative to failures. The number of rows of the quincunx is akin to the number of trials (n), and the number of balls is akin to the number of replications (R) of the experiment. It is, however, one thing to see this worked out by mathematical formula but quite another to see it demonstrated in action. By releasing shot into the quincunx, Galton was simulating replications of the same experiment in real time and, in so doing, made a relatively abstract thought experiment tangible both to himself and to a live audience. As the process illustrated by the quincunx can be cast as a special case of the central limit theorem, the same result, a normal distribution, can be predicted for any scenario whether the underlying variable is discrete or continuous, with distributions that are skewed or uniform. Galton would come to argue that even without knowing the exact mechanism that led the traits of parents to be passed on to their children, if any given trait could be conceptualized as the sum of a large number of independent random events, it follows that when this process is replicated amongst humans in some defned population, a normal distribution will result. Although in this chapter we only focus on Galton’s use of the quincunx as part of his theoretical justifcation for the ubiquity of the normal distribution, as Stigler (1986) emphasizes, it was even more important to Galton in discovering the concept of regression. For this he developed a special two-stage version of the quincunx, discussed in the next chapter.

3.3.3

The Cambridge Mathematics Tripos

There is considerable evidence that Galton regarded the academic examinations in both public schools and universities as instruments for the measurement of mental ability. In his writings, largely targeted to a contemporary British readership, Galton often assumes that the nature of these sorts of examinations

76 Whenever You Can, Count

would have been common knowledge. One of the only examples Galton describes in any detail is the Mathematics Tripos taken by undergraduate students at Cambridge University. The name “Tripos” was a legacy of the oral examinations from the Middle Ages that culminated with a ceremony in which questions were posed to a student representative while the questioner was seated on a three-legged stool or tripod (Ball, 1889). Because these questions required “wrangling” to answer satisfactorily, the top performers were given the title of Wranglers (Bushell, 1960). Although its origin can be dated back to the early 18th century when it was known as the Senate House Examination, by 1824, the Mathematics Tripos had become the defning institutional event for the students who entered Cambridge to study mathematics. As Cambridge was the premier academic institution of the Western world at the time, earning the distinction of Wrangler through one’s performance on the Tripos was an immediate ticket to professional prominence (Gascoigne, 1984; Ball, 1889).7 The Mathematics Tripos of Galton’s era in the mid- to late 19th century consisted of a total of eight papers over 8 days: a frst set of three papers followed by a short break of a few days and then a second set of fve papers. Students could only qualify to write the second set of papers if their performance on the frst set met some minimal threshold determined by the examiners. Within both sets of papers, the sequence was intended to become increasingly difcult, culminating in the fnal “problem” paper on each set of days, intentionally written by the examiners to demonstrate their own “inventive capacity.”8 In total, the papers took 45 hours to complete, and they were considered such a feat of endurance that preparation included not just the study of books and lecture notes but also physical training in the form of daily walks or hikes. Within each Tripos paper, students were generally expected to recall and reproduce “bookwork” in the form of famous mathematical theorems and propositions and then to answer questions that tested their comprehension of the bookwork. The exception to this was the problem paper, for which no direct connection to bookwork was likely. The Tripos results were used to group students into ordered classes of distinction. Among the roughly 400 to 450 Cambridge students completing a 3-year bachelor’s degree, only a subset would sit for the Mathematics Tripos (recall that Galton, notably, did not). Among those taking the Tripos, only the top 100 would earn honors, and among these, a further distinction was made between the top 40, given the title of Wranglers, and the remaining 60 given the title of Senior or Junior Optimes. The highest scorer was given the title of Senior or First Wrangler, while the lowest scoring Junior Optime was given a ceremonial wooden spoon. (That’s right, a wooden spoon.) The frst 100 students at each year’s graduation ceremony would be introduced according to their “order of merit” as determined by the Tripos, with that year’s remaining graduates, earning ordinary poll9 degrees. As the order of merit for the frst

Whenever You Can, Count

77

100 students was a closely kept secret and the stakes were enormous, the ceremony was typically a raucous event. The practice of introducing graduates according to the order of merit was discontinued in 1909 when a number of other reforms to the Tripos were also introduced. Four Cambridge professors proctored the Tripos. Two served as examiners and two as moderators, with the moderators of one year becoming the examiners of the next. The examiners of each year, who would include famous mathematicians and scientists such as James Clerk Maxwell, Isaac Todhunter, and Arthur Cayley, were responsible for assigning scores or “marks” to each student’s written responses, and these marks were at the sole discretion of the examiners. In his frst attempt to make a case for the heritability of psychological attributes, Galton (1869) would use the Tripos as a motivating example, writing in Hereditary Genius that [t]here can hardly be a surer evidence of the enormous diference between the intellectual capacity of men, than the prodigious diferences in the numbers of marks obtained by those who gain mathematical honours at Cambridge. .  .  . The fairness and thoroughness of Cambridge examinations have never had a breath of suspicion cast upon them. (16–17) It is only thanks to Galton’s (1869) inquiries that we have some sense for the distribution of Tripos scores among those who earned honors: Unfortunately for my purposes, the marks are not published. They are not even assigned on a uniform system, since each examiner is permitted to employ his own scale of marks; but whatever scale he uses, the results as to proportional merit are the same. I am indebted to a Cambridge examiner for a copy of his marks in respect to two examinations, in which the scales of marks were so alike as to make it easy, by a slight proportional adjustment, to compare the two together. (18) Galton would go on to show that these scores ranged from a low of 300 marks to a high of 7,500 with a maximum total marks obtainable of 17,000. This puts the difculty of the Tripos in fairly stark relief. The highest scoring student in the 2 years for which Galton had data earned just 44% of the possible marks. I discuss Galton’s interpretation of these results in the next section. Cambridge students planning to specialize in mathematics were expected to hire a former Wrangler to coach them through the Tripos, and in this sense, to the extent that students received personal instruction in mathematics, it was not from their professors but from their coaches (Macfarlane, 1916). The coaches with a track record of students attaining Wrangler status were in high demand.

78 Whenever You Can, Count

One of the most famous of these during Galton’s time was the mathematician Edward John Routh, who coached 600 students between 1855 and 1888. During this 33-year span, 28 out the 33 students earning the title of Senior Wrangler had been coached by Routh. Equally famous was the mathematician and geologist William Hopkins, the man who had coached Routh, known as the “Senior Wrangler maker.” In all, the Cambridge coaches were as famous and well-known as the university professors, and getting access to the best of them was itself a competitive enterprise and provided some indication of status. Not coincidentally, Galton’s father had arranged for him to be coached by William Hopkins (Galton, 1908, 64). Finally, because it is so rare to be able to point to the accomplishments of women in an era in which relatively few educational and professional opportunities were aforded to them, it is interesting to note that it was during Galton’s lifetime that two women, Charlotte Scott and Philippa Fawcett, not only sat for the Tripos, but placed eighth and frst in 1880 and 1890, respectively. Scott had been the frst woman to receive permission to take the Tripos, and the results for both Scott and Fawcett were presented publicly only after the order or merit had been read for the men. The boisterous occasion when it became apparent that Fawcett had scored above the male Senior Wrangler is documented in a letter written by her second cousin: [T]he gallery was crowded with girls and a few men. . . . The foor was thronged by undergraduates. . . . All the men’s names were read frst, the Senior Wrangler was much cheered. . . . At last the man who had been reading shouted ‘Women’. .  .  . At last he read Philippa’s name, and announced she was ‘above the Senior Wrangler’. There was great and prolonged cheering; many of the men turned towards Philippa, who was sitting in the gallery with Miss Clough, and raised their hats. When the examiner went on with the other names there were cries of ‘Read Miss Fawcett’s name again’ but no attention was paid to this. I don’t think any other women’s names were heard, for the men were making such a tremendous noise . . . (as quoted in Series, 1997/1998)

3.4 The Concept of Relative Measurement 3.4.1

Use of the Normal Distribution in Hereditary Genius

Galton’s frst foray into the measurement of psychological attributes took place as part of his attempt to bolster the thesis frst introduced in his 1865 essay Hereditary Talent and Character. Galton’s thesis, in a nutshell, was as follows. All the distinguishable human attributes and qualities that collectively make it more likely to enjoy success as a productive member of society are

Whenever You Can, Count

79

heritable. This includes not only human physical characteristics such as height, weight, strength, and endurance but also psychological traits including, in Galton’s words, intellectual capacity, a love of mental work, a strong purpose, and considerable ambition. Given this, in Galton’s view, it was a moral responsibility to learn more about the mechanism by which these traits are inherited and to support policies that encourage procreation among those with greater amounts of talent and character and discourage procreation amount those with lesser amounts. In short, the essay was Galton’s mission statement for eugenics. We will postpone a discussion of the moral problems with the utopia that Galton envisioned in this 1865 essay for the end of the next chapter. Even without the beneft of historical hindsight, these problems should have been apparent, and it says a lot about the insulated position Galton enjoyed as a member of England’s intellectual and cultural elite that he saw only the possible benefts without any thorough consideration or accounting for the possible costs. Putting aside the ethical implications of Galton’s thesis, as a matter of scientifc inquiry the essay left much to be desired, as it consisted almost entirely of anecdotal arguments. One large impediment to Galton’s ability to ofer empirical support for his thesis was related to the measurability of psychological attributes. To demonstrate that psychological attributes are heritable in the same sense that physical attributes are heritable, at a minimum, one must be able to show that psychological attributes have some intergenerational association. But how would psychological attributes be measured? The strategy Galton enacted in Hereditary Talent and Character was relatively crude but resourceful. He reasoned that there must be some subset of professionals in any feld that is widely recognized as successful on the basis of their demonstrated abilities. Galton would look within such lists for evidence of intergenerational associations. The principal work Galton selected to this end was a reference book called A Million of Facts.10 After constraining his search to 1453 through 1853, Galton found a total of 605 “men of genius” and then set out to determine how many of them shared a common kinship. In total, he found family connections for 102 of the 605, or 1 in 6. He took the same approach with six other biographical dictionaries,11 fnding ratios of 1 in 11, 1 in 3.5, 1 in 6, 1 in 10, 1 in 3 and 1 in 4 (Galton, 1865, 163). Galton made no attempt to validate the criteria the authors of his biographical dictionaries used to include one man over another, no attempt to rank the men who appeared in these dictionaries by eminence, and no attempt to characterize the number of men by profession who would (or should) have been eligible to appear in these biographical dictionaries. In Galton’s view, it had only been necessary to establish that “men of genius” in any profession were exceptionally able and relatively rare, and if granted, he could argue that any two biographers would be likely to come up with similar lists. But even Galton could appreciate that he was missing a baseline for the degree of

80 Whenever You Can, Count

variability to be expected in the mental abilities of some well-defned population. Without this, he had no way to compare the likelihoods he had found for observing a member of some profession on a list of eminence to the likelihood to be expected by chance alone. This was the methodological hole out of which Galton was trying to climb in the years following the publication of his essay, and in his follow-up to the essay, the 1869 book Hereditary Genius, two of the key sources he would turn to for support were the data from two competitive academic examinations and the normal distribution. Through his connections to Cambridge, Galton had been able to acquire four sets of scores for the top 100 students taking the Mathematics Tripos in diferent years (200 students in total, in which each set had been marked by a common examiner). Galton’s objective was to use these data to demonstrate that (1) even in an instance with a highly self-selected sample of students, there was variability in mathematical ability and (2) there was a substantial and obvious distinction between each year’s Senior Wrangler (the student who would be considered the most “eminent”) and all other students. In Figure 3.3, I have created a histogram that shows the distribution of the 200 Tripos scores that Galton had collected. The data met both of Galton’s objectives rather swimmingly.12 Beyond the evident variability in the marks among examinees, Galton (1869) took particular note of the gap between the senior Wranglers and the rest of the distribution:

0 20 40 60 80

Frequency

The lowest man in the list of honours gains less than 300 marks; the lowest wrangler gains about 1,500 marks; and the senior wrangler, in one of the lists now before me, gained more than 7,500 marks. Consequently, the lowest wrangler has more than fve times the merit of the lowest junior optime, and less than one-ffth the merit of the senior wrangler. .  .  . I have received from another examiner the marks of a year in which the senior wrangler was conspicuously eminent. He obtained 9,422 marks, whilst the second in the same year—whose merits were by no means inferior to those

0

2000

4000

6000

8000

Tripos Marks FIGURE 3.3 The Distribution of the Students Earning Honors Over 2 Years of the Cambridge Tripos.

Whenever You Can, Count

81

of second wranglers in general—obtained only 5,642. The man at the bottom of the same honour list had only 309 marks, or one-thirtieth the number of the senior wrangler. . . . Now, I have discussed with practised examiners the question of how far the numbers of marks may be considered as proportionate to the mathematical power of the candidate, and am assured they are strictly proportionate as regards the lower places, but do not aford full justice to the highest. In other words, the senior wranglers above mentioned had more than thirty, or thirty-two times the ability of the lowest men on the lists of honours. They would be able to grapple with problems more than thirty-two times as difcult; or when dealing with subjects of the same difculty, but intelligible to all, would comprehend them more rapidly in perhaps the square root of that proportion. (18–20) Galton would go on to point out that if the gap between a Senior Wrangler and lowest Junior Optime was large, the distance between a Senior Wrangler and the typical man of the same age not attending Cambridge was almost unfathomable. But he still lacked a way to attach a number to the probability that a student, admitted to Cambridge and intent on specializing in math, achieves “eminence” by, for example, achieving Wrangler status after sitting for the Tripos. For this he would take inspiration from Quetelet’s applications of the normal distribution and a new set of examination results, those he was able to gather for 73 young men who had applied for admission into the Royal Military College at Sandhurst in December of 1868. This represented a relatively well-defned, albeit small population, without the same self-selection evident among the Cambridge students choosing to sit for the Tripos. The marks Galton examined ranged from a low of 1,600 to a high of 6,500, with an approximate mean of 3,000. Galton used the range from the mean to the maximum value to get an approximate value for the standard deviation one would expect to observe if the marks were normally distributed. Using these two values, he compared, for each class interval of 700 examination marks, the number of candidates observed against the number predicted.13 These results are reproduced in Figure 3.4. Aside from the truncation of the scores at the bottom of the distribution, Galton (1869) concluded that the actual score distribution was well approximated by the normal distribution, and here, he made a generalization that went well beyond any that had been advanced by Quetelet: There is, therefore, little room for doubt, if everybody in England had to work up some subject and then to pass before examiners who employed similar fgures of merit, that their marks would be found to range, according to the law of deviation from an average, just as rigorously as the heights of French conscripts, or the circumferences of the chests of Scotch soldiers. (33)

82 Whenever You Can, Count

FIGURE 3.4 Observed and Expected Distribution of Examination Marks for Candidates Applying to the Royal Military College at Sandhurst, December 1868.

Source: Galton (1869, 33).

There was, of course, plenty of room to doubt. As Karl Pearson (1924) would note, these results from Quetelet are from our present standpoint not very convincing; but supposing they do show that physical measurements may be approximately described in this manner, it does not follow that psychical measurements will also follow this distribution. The only real evidence Galton gives on this point is to show the marks obtained by 72 Civil Service Candidates in fact and in theory. Tested by modern methods the theory fts the facts to the extent that if the theory were true one sample in six would give results more divergent from the theory than the observed facts are. It cannot therefore be said that Galton demonstrates that intellectual ability is distributed according to the normal law of deviations. We are not even certain of that today. (90) Galton’s work in Hereditary Genius using data from the Cambridge Tripos and the Royal Military College admissions exam represents the only time he would collect and analyze the results from something that resembled what today might be characterized as tests of achievement and aptitude, respectively. It is also the only place where we can fnd an example in which Galton distinguishes—even if only obliquely—between a psychological attribute of interest (i.e., the “mathematical power” of the candidate) and the procedure of administering an examination to return a measure of this attribute in the form of a numeric score. How much

Whenever You Can, Count

83

Galton appreciated the importance of this distinction is an open question. It seems that he did not, for if he had, then he might have recognized from the start that if one accepts the premise that mental ability is quantitative, or at least that it can be ordered, then there are two diferent distributional shapes of interest: the frequency distribution of mental ability in a target population (latent) and the distribution of scores elicited by a test instrument (observable). If anything, the results from these instruments should have impressed on Galton the arbitrary nature of an examination “mark” as a unit of measurement, irrespective of the apparent conformity of a collection of marks to the normal distribution14 (Boring, 1920). Although Galton had clearly jumped to some premature conclusions about the ubiquity of the normal distribution, much of the methodological argument Galton was trying to advance did not require an exact match between observed and theoretical distributions of mental attributes. As with his Hereditary Talent and Character essay, the crux of Hereditary Genius was to show that across a variety of professions (e.g., judges, writers, scientists, poets, musicians, painters, clergymen, wrestlers) if one could locate a list of the members of the profession broadly recognized as eminent, the chance of locating members from the same line of family kinship on that list was far greater than one would predict if eminence in each generation was like a random draw from a normal distribution. In other words, if Galton could be granted the premise that the mental ability for any given profession could be divided into 14 “equal interval” classes (or “grades”), as depicted in Figure 3.5, with eight grades above the average Classiÿcation Of Men According To Their Natural Gifts. Grades of natural ability, separated by equal intervals. Below Above Average average a A b B c C d D e E f F g G x X all grades all grades below e above G On other side of average Total, both sides

Proportionate, viz. one in 4 6 16 64 413 4300 79000 1000000

Numbers of men comprised in the several grades of natural ability, whether in respect to their general powers, or to special aptitudes. In each In total male population of the United Kingdom, million viz. 15 millions, of the undermentioned ages:— of the same age. 20—30 30—40 40—50 50—60 60—70 651000 495000 391000 268000 171000 256791 409000 312000 246000 168000 107000 162279 161000 123000 97000 66000 42000 63563 39800 30300 23900 16400 10400 15696 6100 4700 3700 2520 1600 2423 590 450 355 243 155 233 35 27 21 15 9 14

70—80 77000 48000 19000 4700 729 70 4

1

3

2

2

2





500000 1000000

1268000 2536000

964000 1928000

761000 1532000

521000 1042000

332000 664000

149000 298000

The proportions of men living at different ages are calculated from the proportions that are true for England and Wales. (Census 1861, Appendix, p. 107.) Example.—The class F contains 1 in every 4,300 men. In other words, there are 233 of that class in each million of men. The same is true of class f. In the whole United Kingdom there are 590 men of class F (and the same number of f) between the ages of 20 and 30; 450 between the ages of 30 and 40; and so on.

The Normal Distribution Generalization in Hereditary Genius. Galton used this to establish the extremely low probability of observing a person of eminence by chance.

FIGURE 3.5

Source: Galton (1869, 34).

84 Whenever You Can, Count

(A, B, C, D, E, F, G, X) and eight grades below (a, b, c, d, e, f, g, x), then if the distribution was even just approximately normal in shape, those who were eminent would be found in the top three classes (F, G, and X), and it would follow that the probability of observing someone in one of these classes by chance would be both predictable and very low. Galton could now explain with a sense of precision that had been missing in his 1865 essay, “When I speak of an eminent man, I mean one who has achieved a position that is attained by only 250 persons in each million of men, or by one person in each 4,000” (Galton, 1869, 10). Even if the true probability were much higher (e.g., 1 in 1,000, as Pearson would later argue), if the chances of being rated as eminent were independent of kinship, the probability of observing two or more members from the same family would still be predicted to be near zero. Galton (1869) would describe the generalization aforded him by the normal distribution as follows: It will, I trust, be clearly understood that the numbers of men in the several classes in my table depend on no uncertain hypothesis. They are determined by the assured law of deviations from an average. It is an absolute fact that if we pick out of each million the one man who is naturally the ablest, and also the one man who is the most stupid, and divide the remaining 999,998 men into fourteen classes, the average ability in each being separated from that of its neighbours by equal grades, then the numbers in each of those classes will, on the average of many millions, be as is stated in the table. The table may be applied to special, just as truly as to general ability. It would be true for every examination that brought out natural gifts, whether held in painting, in music, or in statesmanship. The proportions between the diferent classes would be identical in all these cases, although the classes would be made up of diferent individuals, according as the examination difered in its purport. (34) Just as in his essay that preceded it, in Hereditary Genius, Galton’s argument was one with serious faws. As Fancher (1985b) notes, even if one accepted his line of reasoning that all human attributes should follow a normal distribution, the fnding of an association between a man’s professional reputation and kinship could be taken as support of a common environmental cause just as easily as a hereditary one.15 Galton also recognized that much depended on the assumption that his lists of eminent men were formed objectively and that eminence was strongly associated with ability. Beyond this, Galton’s rhetoric to the efect that the grades of ability defned according to deviations from a hypothetical mean represented equal intervals was little more than wishful thinking. Still, like Fechner, from a methodological perspective, Galton’s earliest work was already

Whenever You Can, Count

85

notable for its attempt to connect secondary data, as imperfect as it was, to probability theory and, in doing so, adapt Quetelet’s methods and the law of errors to the study of individual diferences.

3.4.2

A Statistical Scale for Intercomparisons

If Galton’s use of the normal distribution had been relatively ill formed in Hereditary Genius, with only the weakest of justifcations for its applicability to psychological attributes, by the time of his publication of Natural Inheritance 20 years later, his use of the normal distribution had become more sophisticated, even if his justifcation was still largely based on analogy. In fact, Chapters 4 and 5 of Natural Inheritance were, in efect, a miniature textbook for the descriptive application of statistics to the study of individual diferences, possibly the frst of its kind, targeted to a broad audience. In these chapters, Galton uses common language and minimal mathematics to present graphical and tabular techniques for characterizing the central tendency and variability of a frequency distribution and for comparing groups using the area of an empirically derived cumulative distribution function. It was Galton who invented the terms percentile to describe this area (Galton, 1885b, 276) and ogive to describe the curve that mapped these cumulative percentiles either to the original units of measurement or to a statistical scale defned by standard deviation units (i.e., what was then known as the probable error, see Appendix of Chapter 1; Galton, 1875). Galton had frst presented an elaborated approach to measurement using the normal distribution in an 1875 article titled “On Statistics by Intercomparison With Remarks on the Law of Frequency of Error”: The process of obtaining mean values etc. now consists in measuring each individual with a standard that bears a scale of equal divisions, and afterwards in performing certain arithmetical operations upon the mass of fgures derived from these numerous measurements. I wish to point out that, in order to procure a specimen having, in one sense, the mean value of the quality we are investigating, we do not require any one of the appliances just mentioned: that is, we do not require (1) independent measurements, nor arithmetical operations; we are able to dispense with standards of reference, in the common application of the phrase, being able to create and afterwards indirectly to defne them; and (2) it will be explained how a rough division of our standard into a scale of degrees may not infrequently be efected. Therefore it is theoretically possible, in a great degree, to replace the ordinary process of obtaining statistics by another, much simpler in conception, more convenient in certain cases, and of incomparably wider applicability. Nothing more is required for the due performance of this process than to be able to say which of two

86 Whenever You Can, Count

objects, placed side by side, or known by description, has the larger share of the quality we are dealing with. (34) In other words, if a variable were known—or could be assumed—to be normally distributed, Galton was pointing out that the order (rank) of any person in a distribution could be converted into a percentile estimate (i.e., the proportion of people with lower values of the variable), and this percentile could be located on a scale of standard deviation units. These units created a de facto statistical scale even when the only data available came from an ordering of subjects with respect to the attribute of interest by some external observer. In situations in which “absolute” measurement on a ratio scale was possible, it would only be necessary to take the values at the median and the 25th or 75th percentile in order to convert the statistical scale back into the units of the original scale. In situations in which no absolute measurement was possible, “relative” measurement in standard deviation units would be the next best thing and would sufce for most practical purposes. (An illustration of Galton’s approach with simulated data is provided in the Appendix to this chapter.) In summary then, Galton viewed his “method of intercomparison” as a solution to the problem of how to establish a basis for the comparison of individual diferences when measurements were produced from procedures that lacked common units. Instead of focusing on the absolute magnitude of a psychological attribute for a given person, one could instead focus on relative deviations of these magnitudes for a given person from the population average: A knowledge of the distribution of any quality enables us to ascertain the Rank that each man holds among his fellows, in respect to that quality. This is a valuable piece of knowledge in this struggling and competitive world, where success is to the foremost, and failure to the hindmost, irrespective of absolute efciency. A blurred vision would be above all price to an individual man in a nation of blind men, though it would hardly enable him to earn his bread elsewhere. (Galton, 1889, 36) This was Galton’s concept of relative measurement. Placed into the context of Galton’s research into heredity that led to his discoveries of regression and correlation, even if a generation of British fathers and their sons were each given entirely diferent examinations of mathematical ability, it was only necessary to assume that the scores on both could be approximated by a normal distribution and to show that the deviations in one population (e.g., fathers) predicted deviations in the other (e.g., sons).

Whenever You Can, Count

87

What progress had Galton made in his justifcation for the normality assumption by 1889? His theoretical justifcation, which was completely absent in Hereditary Genius, now drew on the central limit theorem and simulations using his quincunx. The idea was that if a human attribute could be conceptualized as approximating the sum of numerous independent causes, with each cause having the same probability of being realized, then a normal distribution would be a predictable outcome. His empirical justifcation now came from primary data he had collected on visitors to an anthropometric laboratory he had established in London in 1884. (The formation of this laboratory and the data Galton collected in it are discussed in more detail in the next chapter.) In Natural Inheritance, he would present the frequency distributions, disaggregated by sex, for measures of height (both standing and sitting), wingspan of the arms, weight (in ordinary indoor street clothes), breathing capacity, strength (of pull and of squeeze), speed of a punch, and visual acuity. The values, expressed in both original and standard deviation units, were presented in crosstabulations across the deciles of each frequency distribution (and for the 5th and 95th percentiles). His samples sizes by sex ranged from a low of 212 for breathing capacity to a high of 1,013 for height. Galton (1889, 201, Table 3.3) computed, for each percentile, the mean across each variable and then compared this observed value to the value that would be predicted according to the formula for the normal distribution. To Galton (1889), the closeness of the match vindicated the faith he had expressed in the generality of Quetelet’s fndings in Hereditary Genius: I confess to having been amazed at the extraordinary coincidence between the two bottom lines of Table 3.3, considering the great variety of faculties contained in the 18 Schemes; namely three kinds of linear measurement, besides one of weight, one of capacity, two of strength, one of vision, and one of swiftness. It is obvious that weight cannot really vary at the same rate as height, even allowing for the fact that tall men are often lanky, but the theoretical impossibility is of the less practical importance, as the variations in weight are small compared to the weight itself. .  .  . Although the several series in Table 3.3 run fairly well together, I should not have dared hope that their irregularities would have balanced one another so beautifully as they have done. It has been objected to some of my former work, especially in Hereditary Genius, that I pushed the applications of the Law of Frequency of Error somewhat too far. I may have done so, rather by incautious phrases than in reality; but I am sure that, with the evidence now before us, the applicability of that law is more than justifed within the reasonable limits asked for in the present book. I am satisfed to claim that the Normal Curve is a fair average representation of the Observed Curves during nine-tenths of their course;

88 Whenever You Can, Count

that is, for so much of them lies between the grades of 5 degrees and 95 degrees. In particular, the agreement of the Curve of Stature with the Normal Curve is very fair, and forms a mainstay of my inquiry into the laws of Natural Inheritance. (56–57) Galton’s conclusion specifc to height (“the agreement between the Curve of Stature with the Normal Curve is very fair”) has since been replicated many times over in the context of national surveys, and it was his analysis of familial relationships with respect to height that led to his discoveries of regression and correlation. Galton presumed that what he had found for physical attributes must also apply to psychological ones. Yet Galton had gathered no new empirical evidence in this regard beyond the distribution of examination marks he had presented 20 years earlier. What about Galton’s theoretical argument, which hinged on the application of the central limit theorem? The latter is a matter of mathematics, and its results follow deductively any time its assumptions have been met. The two most obvious problems were the assumptions of (1) independent random events (e.g., the length of each bone that is summed to defne a person’s stature) and (2) the total number of random events (e.g., the total number of bones). Galton would argue that even when the total number of chance events was relatively small (i.e., 17), the cumulative distribution function for sums using a binomial expansion was almost indistinguishable from that of the normal ogive (Galton, 1875; 1889). His quincunx, which approximated this scenario, provided a visual demonstration that this was the case. In contrast, the possible consequences of a lack of true independence among these events were something Galton recognized but never fully addressed. For example, among the many reasons Galton (1889) would come to focus predominant attention on the heritability of height (i.e., stature) in his empirical work was the fact that with a little imagination one could construct a scenario in which human height could be subject to the central limit theorem: [S]ome of its merits are obvious enough, such as the ease and frequency with which it may be measured, its practical constancy during thirty fve or forty years of middle life, its comparatively small dependence upon diferences of bringing up, and its inconsiderable infuence on the rate of mortality. Other advantages which are not equally obvious are equally great. One of these is due to the fact that human stature is not a simple element, but a sum of the accumulated lengths or thicknesses of more than a hundred bodily parts, each so distinct from the rest as to have earned a name by which it can be specifed. The list includes about ffty separate bones, situated in the skull, the spine, the pelvis, the two legs, and in the two ankles and feet. The bones in both the lower limbs have to be counted, because the Stature depends upon their average length. .  .  .

Whenever You Can, Count

89

The larger the number of these variable elements, the more nearly does the variability of their sum assume a “Normal” character, though the approximation increases only as the square root of the number. The beautiful regularity .  .  . is due to the number of variable and quasiindependent elements of which Stature is the sum. (83–84) To the extent Galton was describing a statistical model for a person’s adult height, it was a linear model with a systematic component (e.g., the stature of a person’s most proximal ancestors) and a random component (the sum of many small factors that lead to certain bones being slightly longer or shorter than average). But a theory that these small factors combine independently (or “quasi-independently”) is no more compelling a priori than one that conceptualizes these factors to be strongly interdependent. And it is relatively straightforward to demonstrate, through computer simulation, that strong violations of the independence assumption can produce skewed or even bimodal distributions. As Boring (1920) would later note, there was a tendency for both Quetelet and Galton to deduce the conditions for the central limit theorem from the empirical result that their data showed a relatively good ft to the normal distribution. Along these lines, with one notable exception, Galton would come to view any empirical departures from a normal distribution as evidence of some form of self-selection or censorship of measurements taken from some target population or of some problem with the way that the data were gathered. A key problem is that nature does not usually reveal to us in advance whether some attribute is measurable as a continuous quantity or the fundamental unit in which it is to be measured. Boring (1920) gives the example of size. One may choose to measure the size of an object in units of length or volume, but if one of these units can be shown to follow a normal distribution, the other (by defnition, because length and volume are nonlinearly related), will not. Galton, and many who took up the methods he introduced, assumed implicitly that psychological attributes are continuous quantities and that they have a normal distribution for the population of interest. But neither assumption is necessarily true or easy to evaluate, irrespective of the convenience and practical utility such assumptions would seem to aford.

3.5

Galton’s Conceptualization of Measurement

In between the publication of Hereditary Genius and Natural Inheritance, Galton devoted considerable thought and attention to the kinds of mental processes and personality characteristics that were good candidates for measurement. He did this as part of a variety of exploratory “inquiries,” the results of which are summarized in Galton’s books English Men of Science: Their Nature and Nurture

90 Whenever You Can, Count

(1874) and Inquiries into Human Faculty and its Development (1883), as well as in a series of publications on the subject of anthropometrics (Galton, 1877, 1885a, 1885b). These inquiries and the analyses found therein show much of the same evidence of creativity and ingenuity that characterized most of Galton’s career. For example, Galton was one of the frst to use the method of composite photography to investigate whether a person’s facial features could be associated with criminal tendency (he found no clear association), and he was the frst to introduce comparisons between twins as a method to assess the relative importance of hereditary and environmental infuences (he found environmental infuence to be of lesser importance). What is much harder to fnd in Galton’s writings during this period, or the one that followed up to his death, is a coherent account as to what he viewed as the necessary and sufcient conditions for measurement, and what he saw as the distinguishing features between the measurement of physical and psychological attributes. Although, as we have seen, Galton did draw a contrast between the practices of his relative measurement and “absolute” or “exact” or “actual” measurement, this contrast never seems to have been fully formed. In Galton’s 1908 autobiography, where we might expect to fnd his mature refections on the measurement of human attributes, Galton (1908) describes his vision for widespread anthropometric laboratories in which “Human Faculty might be measured so far as possible” that its measurements should efectively “sample” a man with reasonable completeness. It should measure absolutely where it was possible, otherwise relatively among his class fellows, the quality of each selected faculty. The next step would be to estimate the combined efect of these separately measured faculties in any given proportion, and ultimately to ascertain the degree with which the measurement of sample faculties in youth justifes a prophecy of future success in life, using the word “success” in its most liberal meaning. (267) What Galton meant by relative measurement was clear—this was his method of intercomparison, which relied on transformations of ranks into deviation units under the assumption that the attribute of interest was normally distributed. The only prerequisite for this was the ability of some qualifed external observer to make judgments about order: We can lay down the ogive of any quality, physical or mental, whenever we are capable of judging which of any two members of the group we are engaged upon has the larger amount of that quality. I have called this the method of statistics by intercomparison. There is no bodily or mental attribute in any race16 of individuals that can be so dealt with, whether

Whenever You Can, Count

91

our judgment in comparing them is guided by common-sense observation or by actual measurement, which cannot be gripped and consolidated into an ogive with a smooth outline, and thenceforward be treated in discussion as a single object. (Galton, 1883, 52) Galton’s use of the qualifer actual before measurement is interesting, as it suggests a recognition that his method of intercomparison was not measurement in the way that mathematical physicists of the time might have understood the word. Did Galton view actual measurement as something constrained by the availability of standard units defned by invariant physical laws? Or did Galton see a potential to expand Fechner’s program of research in psychophysics to cover a wider range of psychological attributes beyond those of sensation? There is a hint of this line of thought in the outset of an address Galton gave to the Anthropological Department of the British Association in 1877: [I]t has of late years become possible to pursue an inquiry into certain fundamental qualities of the mind by the aid of exact measurements. Most of you are aware of the recent progress of what has been termed Psychophysics, or the science of subjecting mental processes to physical measurements and to physical laws. I do not now propose to speak of the laws that have been deduced,17 such as that which is known by the name of Fechner, and its numerous ofshoots, including the law of fatigue, but I will briefy allude to a few instances of measurement of mental processes, merely to recall them to your memory. They will sh[o]w what I desire to lay stress upon, that the very foundations of the diferences between the mental qualities of man and man admit of being gauged by a scale of inches and a clock. (4) What Galton had in mind with respect to actual instances of the measurement of mental processes were essentially tests of reaction time and discrimination, and these were indeed measures Galton “gauged by a scale of inches and a clock” so to speak. In the same address, he would use thermometry as a paradigm for the measurement of psychological attributes: Wherever we are able to perceive diferences by intercomparison, we may reasonably hope that we may at some future time succeed in submitting these diferences to measurement. The history of science is the history of such triumphs. I will ask your attention to a very notable instance of this, namely, that of the establishment of the thermometer. You are aware that the possibility of making a standard thermometric scale wholly depends

92 Whenever You Can, Count

upon that of determining two fxed points of temperature, the interval between them being graduated into a scale of equal parts. (Galton, 1877, 7) On a generous reading, premised on his desired parallel with thermometry, one might argue that Galton viewed “absolute” or “actual” or “exact” measurement as the prospective end “triumph” of an ongoing program of research into the measurement of psychological attributes. In such a reading, Galton would have been implying at least three stages of research activities. The frst stage would involve (a) attempts by external observers to rank subjects with respect to a well-defned attribute of interest, (b) the ranks would be converted to percentiles or standard deviation units, and (c) eforts would be undertaken to show that what was at this point solely a relative measure would correlate with other measurable human attributes and outcomes of importance. A second stage would involve the development of instrumentation to replace this method of relative measurement with absolute measurement. The outcome of the second stage would likely be, at least at frst, some set of numeric values expressed on an arbitrary scale. In a third stage, the objective would be to search for laws connected to the attribute of interest that could establish meaningful reference points along this scale akin to the boiling and freezing points of water. The intervals of a scale for exact or absolute measurement might then be established relative to some division of the distance between these two fxed points. It is worth noting that even in this generous reading of Galton’s meaning, such an approach would have already represented a marked departure from Fechner’s psychophysical program of measurement. However, if this was Galton’s path from relative to absolute measurement, it leaves many open questions, not the least of which is what one should make of a scale that results at the end of the second stage just described, when an instrument has been devised that produces numeric values with equivocal properties. Galton’s writings are somewhat ambiguous on this point. For example, in an essay on prospects for the measurement of character, Galton (1884) refers to the art of measuring intellectual abilities as “highly developed,” using the metaphor of a ruler to describe the approach taken by examiners to measure the “intellectual performances of the candidates whom they examine” (179). But surely by then Galton had to have been well aware that the score from academic examinations bore little resemblance to the scale of a thermometer. After all, 15 years earlier, in Hereditary Genius, he had emphasized that adjacent diferences in Tripos examination marks among students at diferent locations of the distribution did not convey the same meaning about diferences in students’ mathematical ability. And in Natural Inheritance, Galton gives the example of a fairly crude 5-point scale that had been used to rate former medical students at St. Bartholomew’s Hospital in terms of their professional success following their graduation (the categories of the rating scale were distinguished, considerable, moderate, very limited, failures). In this case, Galton (1889, 47–48) refers to the

Whenever You Can, Count

93

resulting measure as no more than an inexact measure (“observations that are barely exact enough to be called measures”) and then shows how such values can be transformed into percentiles or standard units to become a relative measure, provided that they show an approximate ft to the normal distribution. My hunch is that Galton, despite some of his inconsistent terminology and rhetoric, did not view academic examinations as an example of absolute measurement but instead used the qualifers absolute, exact, or actual to refer to instances in which an attribute was measurable in physical units. A major basis for this hunch is the fact that in every published case in which Galton designed and implemented measuring instruments for psychological attributes, there is never an instance in which he attempted to establish a scale with units that fell somewhere in between the extremes of the standard deviation units of a relative “statistical” scale and the standard physical units of an “absolute” scale of measurement. This argues against the generous reading of Galton’s thermometry paradigm described earlier. So why then, did Galton speak so glowingly about the “highly developed” art of measuring intellectual faculties? This is where our dive into the details of the Cambridge Tripos may pay us some dividends. Recall that the Tripos was a comprehensive, multiday examination designed for the sole purpose of producing a valid ranking of students with respect to their mathematical ability. The evidence used for these rankings came in the form of students’ written responses to the same set of prompts. In addition, the observers responsible for scoring these responses were the kinds of professionals (distinguished professors of mathematics) for whom Galton had the utmost respect as external arbiters. If a prerequisite for Galton’s relative measurement was the availability of valid judgments about order, then the rankings that resulted from any academic examination, to the extent it approximated the Tripos ideal, would have been far more developed than, say, the rankings that derived from the subjective judgments of an author assembling lists of eminent professionals (i.e., what Galton had used in Hereditary Genius). Whether it was to be engaged in Galton’s newly invented relative sense or in the absolute sense that was the hallmark of the physical sciences, for Galton, the measurement of human attributes was frst and foremost an instrumental challenge, and the validation of the measures was to be made on the practical grounds of whether they were predictive of later success. It was a challenge that required the ingenuity and persistence to design, construct, and refne instrumentation that could be applied to large numbers of people at scale, in a short amount of time, at minimal cost. For at least the last 30 years of his life, Galton repeatedly used his wealth, reputation, and professional connections to advance this argument in infuential public settings. Consider, as an example, an excerpt from Galton’s letter to the editor, published in the journal Nature in 1880: The observation I desire to make is that every hospital fulfls two purposes, the primary one of relieving the sick, and the secondary one of advancing pathology, so every school might be made not only to fulfl the primary

94 Whenever You Can, Count

purpose of educating boys, but also that of advancing many branches of anthropology. The object of schools should be not only to educate, but also to promote directly and indirectly the science of education. It is astonishing how little has been done by the schoolmasters of our great public schools in this direction, notwithstanding their enviable opportunities. I know absolutely of no work written by one of them in which his experiences are classifed in the same scientifc spirit as hospital cases are by a physician, or as other facts are by the scientifc man in whose special line of inquiry they lie. Yet the routine of school work is a daily course of examination. There, if anywhere, the art of putting questions and the practice of answering them is developed to its highest known perfection. In no other place are persons so incessantly and for so long a time under close inspection. Nowhere else are the conditions of antecedents, age, and the present occupation so alike as in the boys of the same form. Schools are almost ideally perfect places for statistical inquiries. . . . If a schoolmaster were now and then found capable and willing to codify in a scientifc manner his large experiences of boys, to compare their various moral and intellectual qualities, to classify their natural temperaments, and generally to describe them as a naturalist would describe the fauna of some new land, what excellent psychological work might be accomplished? But all these great opportunities lie neglected. The masters come and go, their experiences are lost, or almost so, and the incidents on which they were founded are forgotten, instead of being stored and rendered accessible to their successors; thus our great schools are like mediaeval hospitals, where case-taking was unknown, where pathological collections were never dreamt of, and where in consequence the art of healing made slow and uncertain in advance. Some schoolmaster may put the inquiry; What are the subjects ftted for investigation in schools? I can only reply: Take any book that bears on psychology, select any subject concerning the intellect, emotions, or senses in which you may feel an interest; think how a knowledge of it might best be advanced either by statistical questioning or by any other kind of observation, consult with others, plan carefully a mode of procedure that shall be as simple as the case admits, then take the inquiry in hand and carry it through. (10) Here we see Galton’s prophecy for the 20th-century development of the subfeld that would come to be known as educational measurement. In the next chapter, we examine Galton’s vision and proof of concept for the implementation of this instrumental approach to the measurement of both physical and psychological attributes as part of a network of anthropometric laboratories. Galton’s message was simple. Defne the attribute; create the instrument; make the comparisons. Above all else, whenever you can, count.

APPENDIX An Illustration of Galton’s Method of Intercomparison (Relative Measurement)

To better appreciate Galton’s approach, consider an example of a physical attribute, such as the height of a population of adult men between the ages of 23 and 51, that is a known quantity measurable with respect to the standard unit of inches. Galton collected these measurements from visitors to his Anthropometric Laboratory in 1884, and he found that the mean and standard deviation of height was 68.1 and 2.5. Now, let’s imagine that we have 20 men taken from this same population, and although we would like to make inferences about their heights, we do not have a stadiometer or even a tape measure available. What we do instead is to order the men from smallest to largest by visual inspection. If height follows a normal distribution, then these ranks can be used to estimate each man’s relative location in the population in standard deviation units. If the population mean and the standard deviation are known, these relative locations can then be converted into estimates of actual height, even though we have not taken absolute measurements of height from any of the 20 men. Figure 3.6 shows the results from simulating 20 height measurements according to this scenario. The frst column in the table shows the true heights of these 20 men sorted from shortest to tallest, and the second shows these heights transformed into standard deviation units (i.e., z-scores). For a scenario in which Galton’s method would be applied, these values would be unknown. In fact, only column 3, the rank of each man by height, is observed. In column 4, these ranks are converted into percentiles by dividing each by the total sample size (i.e., 20) and making a small adjustment18 to keep them bounded between 0 and 100 (column 4). Now, this is where the assumption of normality comes into play. If height is normally distributed, then each man’s percentile location in the height distribution can be associated with that man’s location in the distribution expressed in z-score units. In modern terms, we can fnd this by

96 Whenever You Can, Count

FIGURE 3.6

Simulated Height Data.

inverting the standard normal ogive. The standard normal ogive gives the cumulative percentile of a distribution for any given z-score; by inverting it, we fnd the z-score for any given cumulative percentile. The result is shown in column 5. Finally, if the mean and the standard deviation of the population are known, we can use these to transform the z-scores into measurements of heights in inches; this is shown in column 6. Note that the values in columns 5 and 6 are estimates. Because the values were simulated, the estimated values can be compared with the true values in columns 1 and 2. Figure 3.7 shows the plot of the estimated z-score values (x-axis) and true z-score values (y-axis) with a line of perfect agreement superimposed. Although the estimates are not perfect, they are very close, with a correlation of .964. The estimates would be even closer with a larger sample size. Galton invented the term ogive for his preferred way of displaying descriptive results from a normally distributed variable. Figure 3.8 compares this empirical relationship between the estimates from columns 4 and 5 with the theoretical relationship we would expect based on a theoretical normal ogive. Again,

97

0 –3

–2

–1

Actual z

1

2

3

Whenever You Can, Count

–3

–2

–1

0

1

2

3

Estimated z

FIGURE 3.7

Comparison of Estimated and Actual z-Scores, r = .964.

FIGURE 3.8

Comparison of Theoretical and Empirical Ogives Based on Simulated Data.

we can see that the empirical estimates are closely approximated by the theoretical ones, as we would expect since the data were simulated from a normal distribution. The problem with this approach in the context of psychological attributes is the assumption of equal intervals. For the attribute of height, a standard deviation unit is 2.5 inches. If we compare the heights of two men who are below average in height and two men who are above and if in both cases the diference is .5 standard deviation units, we can be sure that both diferences convey the same information about magnitude. The same need not be the case

98 Whenever You Can, Count

when comparisons are being made in standard deviation units for, say, pairs of examination scores that are above and below the mean.

Notes 1 I couldn’t resist the pun given that it was Galton (1892) who came up with both the idea and the method of using fngerprints as a method of identifcation. 2 “I may here speak of some attempts by myself, made hitherto in too desultory a way, to obtain materials for a ‘Beauty Map’ of the British Isles. Whenever I have occasion to classify the persons I meet into three classes, ‘good, medium, bad,’ I use a needle mounted as a pricker, wherewith to prick holes, unseen, in a piece of paper, torn rudely into a cross with a long leg. I use its upper end for the ‘good,’ the cross-arm for ‘medium,’ the lower end for ‘bad.’ The prick-holes keep distinct, and are easily read of at leisure. The object, place, and date are written on the paper. I used this plan for my beauty data, classifying the girls I passed in streets or elsewhere as attractive, indiferent or repellent. Of course this was a purely individual estimate, but it was consistent, judging from the conformity of diferent attempts in the same population. I found London to rank highest for beauty; Aberdeen lowest” (Galton, 1908, 324–325). 3 In a timeline Pearson created to demarcate major events in Galton’s life, he gives the period from 1844 to 1849 the simple label “hunting and shooting,” using the same description Galton would provide in his own autobiography. 4 Modern biology and the Mendelian model of inheritance were just around the corner but would not receive widespread attention until the turn of the 20th century. 5 One might mistakenly assume that studies pertaining to heredity occupied all of Galton’s time during this period. Not so. Galton continued to publish studies and inventions specifc to his long-standing interests in geography, anthropology, meteorology, and travel throughout this same period. 6 In this sense, we might say that Quetelet was the frst to explicate the mathematical concept of a “true score” that would eventually become the basis for the classical theory of measurement reliability. See Chapter 6. 7 The Mathematics Tripos still exist, making it one of the world’s oldest and arguably most famous long-standing high-stakes written examinations. As of 2019, the Tripos consisted of four 3-hour-long papers taken at the culmination of each of 3 years as an undergraduate. See www.maths.cam.ac.uk/undergrad. 8 Forsythe (1935, 176) writes, “Some of these might be ‘doable’: most of them were not. They represented the inventive possibilities of the utmost range of the setters fancy, without regard to teaching, or books, or suitability, perhaps fairly described in Pope’s words ‘Tricks to shew the stretch of human brain, Mere curious pleasure or ingenious pain.’” 9 The word poll was a derivative from the Latin expression for mob, hoi polloi. 10 In his essay, Galton mistakenly attributed the authorship of this book to a Sir Thomas Phillips. The actual name of the author was Sir Richard Phillips, and the full title of his book was A million of facts connected with the studies, pursuits and interests of mankind, serving as a common-place book of useful reference on all subjects of research and curiosity: collected from the most respectable modern authorities. Even Galton was not convinced that he had found an entirely credible source, writing, “I do not mean to say that Sir Thomas Phillips’s selection is the best that could be made, for he was a somewhat cro[t]chety writer.”

Whenever You Can, Count

99

11 An in-press untitled dictionary he attributed to Charles Hole, Edward Walford’s 1862 book Men of the Time, Michael Bryan’s 1849 Dictionary of Painters and Engravers, and F. J. Fétis’s Biographie Universalle des Musiciens. 12 Galton took these results one step further to argue that they supported his thesis that diferences in ability were primarily caused by diferences in an inherited talent for mathematics, as opposed to diferences in environmental conditions. By Galton’s reasoning, the students taking the Tripos had all received similar instruction while at Cambridge and all had the same incentive to study and prepare for the exam. Hence, the most plausible explanation for the variability in performance was inherited diferences in their mathematical ability. But here Galton ignored the very environmental excuses he had himself provided for his poorer than expected performance on the “Little Go” exam in his second year at Cambridge and for his decision not to sit for the Tripos (Pearson, 1914). He also ignores diferences in access to the best Tripos coaches, another plausible environmental explanation for some of the observed variability. 13 To compute frequencies for the class interval shown in each row “according to theory,” Galton estimated the mean score to be 3,000, and if the scores were normally distributed, then roughly 50% of the distribution fell between 3,000 and the maximum score of 6,500, which implies an SD of 3,500/3 = 1,167. Hence, given that the area under the normal curve for scores 70 70 69 68 67 66 65 64 11

I

>11

LA >11

Note: B = basics of communication; LC = concrete language; LA = abstract language and reasoning; S = sensory; M = memory; I = imagery.

Mental Tests and Measuring Scales

151

would be asked, “Very well, then tell me what it is?” In the more difcult tests, the child would be • • • •

given the word obéissance and asked to fnd one or more words that rhymed with it (test 24); asked to answer a series of sentence completion tasks9 within a short paragraph such as “The weather is clear, the sky is ______” (test 25); asked to create a sentence that included three given nouns, “Paris, river, fortune” (test 26); and asked to decide what should be done in a variety of social situations (test 27).

This frst iteration of the Binet–Simon scale depicted in Table 5.1 was publicly introduced in 1905 in L’ Annee Psychologique across three separate articles (Binet & Simon, 1905a/1916, 1905b/1916, 1905c/1916). The frst article, “Upon the Necessity of Establishing a Scientifc Diagnosis of Inferior States of Intelligence,” provided the practical impetus for the development of Binet’s psychological method and reviewed the historical literature regarding preexisting diagnostic approaches (the previously mentioned medical and pedagogical methods). The second article, “New Methods for the Diagnosis of the Intellectual Level of Subnormals,” described the content of the actual tests to be used and the procedures that were recommended for their administration. It was in the third article, “Application of the New Methods to the Diagnosis of the Intellectual Level Among Normal and Subnormal Children in Institutions and in the Primary Schools,” that Binet presented the empirical results from applying the psychological method and in the process introduced the innovation that turned the Binet–Simon tests into the Binet-Simon measuring scale: the link between test performance and age. Through his network of connections at La Société, Binet had been able to get access to a small sample of children from working-class Parisian households who were attending public schools. The results reported in Binet and Simon (1905c/1916) came from 50 children evenly split into groups of 10 for the ages of 3, 5, 7, 9, and 11. The distribution of performance across tests in each group provided a normative baseline for comparison with the performance of any single child of the same age. If the tests had been successfully designed, and under the premise that intelligence is in a state of development during the years of primary schooling, Binet reasoned that it should be the case that age and the ease of successfully completing a test should be positively associated. If so, to establish a scale in temporal units, each test could be “located” according to the lowest age group likely to pass it. Binet’s estimates for the age norm to be associated with each test are shown in the last column of Table 5.1. With this in hand, the measuring procedure to be employed was as follows: Consider the child’s intellectual level unknown. If a test has been failed that is located at or below the child’s chronological age, it provides evidence of a defcit in the

152 Mental Tests and Measuring Scales

child’s intellectual level. If a child passes the tests that have been located above the child’s chronological age, it provides evidence of an advance in the child’s intellectual level. In both instances, defcit or advance is always relative to the normative performance of children at the same age. The diagnosis of a child’s intelligence still required a subjective judgment based on the pattern of performance across tests but one that was premised on the cumulation of normative comparisons. The details of the methods used to select both tests and samples of “normal” children were typically a bit sketchy in Binet’s writing.10 In assessing Binet’s infuence in the years after his death, one reviewer noted that “[i]n studying his papers on measuring the intelligence of children the reader is oftentimes at a loss to know just what experimental work was done or how the resulting judgments were arrived at” (Bell, 1916, 612). In any case, for the 1905 age scale, the results could only be taken with a grain of salt given the small samples of children. Binet and Simon were in the early stages of “groping about” for a workable procedure, and this was largely still the case by the time of the scale’s third iteration in 1911. Surely Binet knew it was an exaggeration to refer to what he and Simon had created as a “measuring scale.” But just as surely, he believed they were onto something. What they had invented was in some sense a more efcient combination of the medical and pedagogical methods of classifcation. It was a diagnostic examination, but the basis for diagnosis did not rest solely with the skill and proclivities of the clinician but also in the quality of the evidence elicited by the tests and the normative contrast. Beyond the specifcs of this pilot procedure, Binet and Simon were also providing a proof of concept for a principle of mental test construction: that the difculty of a test should be predictable according to two diferent criteria, the complexity of the mental processes the test has been designed to elicit, and the age of the child taking it. The complexity intended for the test served as the hypothesis, and the age at which a child was able to complete the test became an empirical test of the hypothesis. That Binet had recognized the need to use a child’s age as an external developmental criterion may seem obvious in hindsight, but in the context of diagnosing mental disabilities in school settings, where children had been grouped by grade, it represented a major breakthrough. Why? Because the ages of children in the same grade could vary considerably (students were held back or advanced across grades on the basis of their performance, not their age), and in a previous study, Binet had discovered that diferences in the ages of students within the same grade could confound real diferences evident in the ability of students to memorize lines of written prose. Once he controlled for age, much larger diferences in memory across grades became apparent (see Wolf, 1973, 165–167).

Mental Tests and Measuring Scales

153

We now have seen two of the three key elements of Binet’s approach: the tests that were to be administered, and the frame of reference, age, that was to be used as the basis for classifcations. What we are missing is the element in between the test and the classifcation: the method of scoring test responses. Here, Binet’s focus was not so much on whether a child could answer a test question “correctly” (although many questions did indeed have correct answers) but in analyzing the types of errors that students made when they encountered a challenging task. It was in such encounters that Binet looked for evidence of what it was that caused a child to come to an ill-conceived judgment. For example, there were 25 unique questions nested within test 27, some of which were designed to be easy to answer (When one is sleepy, what should one do?) and some which were designed to be hard (What should one do when one has committed a wrong act which is irreparable?). Four questions written to be of “intermediate” difculty included 1. 2. 3. 4.

When one has need of good advice—what must one do? Before making a decision about a very important afair—what must one do? When anyone has ofended you and asks you to excuse him—what ought you do? When one asks your opinion of someone whom you know only a little— what ought you say?

The answers to these questions were intentionally designed to be culturally dependent to refect Binet’s sense of the social environment to which a Parisian child would be expected to adapt. It is in this sense important to appreciate that culture and intelligence were intertwined in the Binet–Simon scale—just as they would remain intertwined when Terman would eventually create the Stanford–Binet revision for use in the United States. The qualitative analysis of the answers children gave to test questions was a defning feature of Binet’s approach to the “scoring” of responses. Of special interest was the diference in the amount and the kind of errors children made to the same test across age groupings. Binet focused on silences (nonresponses) and answers that were perceived to indicate clear evidence that a child was unable to comprehend the gist of the question (Binet referred to these as “absurd” replies). When posing the four questions listed earlier from test 27 to his sample of “normal” Parisian children, he found that 7-year-old children were likely to supply the occasional absurd reply (e.g., “I would do nothing”) or remain silent. In contrast, silences were rare among 9- and 11-year-old children, and absurd replies almost nonexistent. Table 5.2 presents excerpts related to Binet’s proof of concept diagnoses of three children (Martin, Raynaud, and Ernest). The three children had been sent to Binet without any other information other than their ages, which were 12, 11,

154 Mental Tests and Measuring Scales TABLE 5.2 Excerpts From Binet’s Diagnosis of Three Children Using the 1905 Binet– Simon Scale

Child

Diagnosis Based on Test Performance

Martin

His memory for numbers has two characteristics: in appearance it is normal, because he succeeds in repeating exactly, a series of 5 numbers, as do certain children of eleven years [test 19]; he is therefore from this point of view almost normal, slightly inferior, however, because the normal repeats 6. But it is characteristic of him that he judges very poorly the corrections of his reproductions. We require him to say, “That is right” when the repetition has been correct, and “That is not right” when it has been incorrect. In the frst place—a fact that is important—he does not submit to this convention, and he must be reminded of it about 8 times before he begins to give the signal spontaneously—and again he often fails. These lapses prove to us the difculty he has in learning. In the second place we learn he is truly an optimist. He believes that he has correctly replied in many cases where he has deceived himself. Six times he declares “That is right,” when he was wrong and only 4 times did he admit he was wrong, and in 3 of these he had said nothing at all. These are indeed characteristic errors, where the absence of attention borders closely an absence of judgment. Such curious cases require careful study. We suppose that by a strong appeal to the attention, by long training one might succeed in arousing this lagged judgment. But that would no longer be an examination, it would be education. (Binet & Simon, 1905c/1916, 173–174) Raynaud He has a poor memory, this is one of his weak points . . . he has an excellent sensorial intelligence. Without doubt it is this which should be cultivated in him rather than yoking him to abstract notions . . . the sensorial intelligence of Raynaud is better than that of Martin, and the proof is the skill with which he arranges the weights [test 22], and draws the cut in the paper [test 29]. If Martin is a low grade moron type, Raynaud represents a type of intellect a little higher. (Binet & Simon, 1905c/1916, 179) Ernest He is in the same class as Raynaud only slightly less marked. For abstract questions [test 27] he has 5 absurdities, 1 silence, and 10 replies marked 3 or better. This is a little better than Raynaud gave us. But ones sees at the same time that this is the level of children of seven years; and moreover normal children of seven years are more prudent; when they do not understand, they keep silence and here is truly a condition where silence is golden. (Binet & Simon, 1905c/1916, 180)

and 11, respectively. Several hallmarks of Binet’s approach become visible in these excerpts: •



Although he would conclude that all three children had intellectual disabilities, the disabilities were given important qualitative distinctions, some of which (in the case of Martin and Raynaud) led Binet to ofer unique suggestions about the interventions to which each child should be exposed. The tests were not used at this juncture to generate a total numeric score but were analyzed individually for patterns within and across Binet’s three diferent test categories (memory, sensory, abstract language).

Mental Tests and Measuring Scales





155

Binet’s qualitative focus on diferences in errors of judgment is easy enough to spot; what he found most revealing of Martin’s disability was test 19, in which the child struggled to repeat back disordered sequences of six numbers. The problem was not so much that Martin had failed to memorize the numbers but that he was consistently unable to recognize when his repeated sequence was wrong. Raynaud was, in one sense, quite similar to Martin in that he struggled with tests of memory, but unlike Martin he did well on two of the more difcult sensory tests (tests 22 and 29). Binet uses the results from his normative sample as a basis of comparison for each child. All three children are found to struggle with the abstract questions of test 27, answering with multiple absurd replies typical of the responses of 7-year-olds in the normative sample.

5.3.3

The 1908 and 1911 Revisions

By the time Binet and Simon had completed a revision to the scale in 1908, the total number of tests had more than doubled from 30 to 64. Of the original 30 tests, only seven had been dropped; the rest were kept the same or modifed. A common modifcation Binet and Simon adopted was to take what had been a single test and to expand it into multiple tests by varying a feature of the test’s difculty. For example, two of the memory tests from the 1905 scale (test 11, involving memory of numeric digits, and test 15, involving the memory of a sentence) were split into diferent locations along the 1908 age scale by varying the number of digits and syllables in the sequences a child was asked to repeat. Half of the 64 tests on the 1908 scale were new, and a majority of these (22 out of 32) fell into two new categories with respect to the unique mental ability that they were targeting: basic numeracy and acquired knowledge. The seven new numeracy tests, ranging in location from the ages of 5 and 9, required a child to count coins (up to 4 at age 5; up to 13 at age 7; up to 9 using different denominations of coins at age 8), count the fngers on each hand (age 7), count backward from 20 to 0 (age 8), make change from coins (age 9), and distinguish between nine diferent varieties of French coins. Many of the 15 new tests of acquired knowledge focused on the recognition of basic facts central to a child’s identity (asking a child to provide their family name, age, and sex) or facts related to an understanding of time (asking a child to distinguish between morning and evening, to identify the date, and to name, in order, the days of the week and the months of the year). One new test required a child to make an aesthetic judgment about beauty when presented with three pairs of women’s heads drawn on pieces of paper. Binet had also introduced four new tests between the ages of 7 and 9 in which a child was asked to copy a written sentence (age 7), write from dictation (age 8), and retain two (age 8) and six (age 9) memories after reading a short passage. He included these tests because in their focus on written

156 Mental Tests and Measuring Scales

communication, they were best positioned to distinguish the threshold between “imbecile” and “moron” categorizations among institutionalized adults. But he would eventually remove them from his 1911 revision because children in school settings were likely to have practiced these kinds of tasks as part of daily instruction. More generally, this distinction between psychological intelligence, on one hand, and knowledge that has been attained through scholastic or cultural exposure, on the other, posed a thorny issue. Binet recognized that success on his tests could depend on a variety of factors: The result depends: frst on the intelligence pure and simple; second, on extra-scholastic acquisition capable of being gained precociously; third, on scholastic acquisitions made at a fxed date; fourth, on acquisitions related to language and vocabulary, which are at once scholastic and extrascholastic, depending partly on the school and partly on the family circumstances. (Binet & Simon, 1908/1916, 259) Still, he hoped that the use of his collection of tests would make it possible to “free a beautiful native intelligence from the trammels of the school” (Binet & Simon, 1908/1916, 259). Of course, success on almost all Binet’s tests could be infuenced by scholastic and extra-scholastic exposure. But this in itself was not an argument for removing a test; the key was whether the infuence was due to diferences in scholastic and extra-scholastic exposure to which the typical French child could be expected to have access. For example, even though children clearly receive instruction and practice counting from the earliest grades of school, Binet would argue that all French children, irrespective of socioeconomic advantage, would have had opportunities to learn to count. By the time of Binet’s 1911 revision, the loose connections that had been present for the 1905 scale between groups of tests and ages spaced by two years had now been replaced by collections of fve tests per unique age. This fner gradation of tests by age had been established from a larger norming sample of 203 Parisian children between the ages of 3 and 13 in the 1908 revision and then rearranged somewhat on the basis of empirical results from other investigators and from Binet himself by 1911.11 Changes to the scale for the 1911 revision generally focused on modifcations of the locations of tests on the upper end of the age scale. Tests that had been previously associated with ages 11, 12, and 13 were shifted to ages 12, 15, and a new “adult” category (presumably associated with ages 16 and higher). Binet also added two new tests to fesh out this adult category. These changes left the scale with discontinuities at ages 11, 13, and 14. Figure 5.2 depicts the evolution of the Binet–Simon scale from 1905 to 1911.

1905 Age Scale 0-2

3

5

7

9

1 2 3 4 5 6 7

B B B B B B

LC LC LC

S M S LC

M LC M M M M LC

S S LA

Tests

0-2

3

4

5

1 2 3 4 5 6 7 8

B B B B B B

LC LC M M A

A LC M S

S S M N MI

Tests 1 2 3 4 5 6

0-2 B B B B B B

3 LC LC M M A

4 A LC M S

5 S S M N MI

FIGURE 5.2

1908 Age Scale 6 M A LC M A A

11

> LA MI MI LA

LA LA

7

8

9

10

11

12

13

S N A S N LC N A

A N A N LC A

A A LA A N S

A A LA LA LA

LA LA LA LA LA

M LA M LA LA

MI MI LA

8 LC N S A M

9 A LA A A LA

10 S MI LA LA LA

11

12 S LA LA LA LA

13

1911 Age Scale 6 7 S A LC LC M S N N A A

14

15 M LA M LA LA

Adults MI MI A LA LA

Changes to the Composition of the Binet–Simon Measuring Scales from 1905 to 1911.

Note: Each test required one or more of four abilities: comprehension, direction, invention and censure. Distinguishing requirements of tests beyond these abilities were B = basic communication, LC = understanding of concrete language, LA = understanding of abstract language and reasoning, M = memory, S = sensory discrimination, MI = mental imagery, N = numeracy, A = acquired cultural knowledge

Mental Tests and Measuring Scales

Tests

157

158 Mental Tests and Measuring Scales

As had been true in 1905, the upper end of the 1908 and 1911 Binet–Simon scales continued to be primarily defned by tests of abstract verbal reasoning. An example of a new test along these lines that Binet and Simon frst included for the 1908 revision asked a child to critique fve sentences read aloud. The child was warned in advance to “listen attentively and tell me every time what there is that is silly.” An example of a sentence that would follow: “Somebody used to say: If in a moment of despair I should commit suicide, I should choose Friday, because Friday is an unlucky day and it would bring me ill luck.”12 Binet had initially found that no children at age 9 could detect the absurdities in three of the four sentences, a quarter could do so at age 10, and half could manage at age 11. The test was located at age 10 on his scale because this was the earliest age at which at least some children in the normative sample could complete the test successfully. The battery of tests that comprised the fnal 1911 scale is depicted in Table 5.3. A design principle of Binet and Simon that became all the more apparent

TABLE 5.3 The 1911 Binet–Simon Measuring Scale of Intelligence

Age 3 1. Points to nose, eyes, and mouth 2. Repeats two digits 3. Enumerates objects in a picture 4. Gives family name 5. Repeats a sentence of six syllables Age 4 6. Identifes his or her sex 7. Names key, knife, and penny 8. Repeats three digits 9. Compares two lines Age 5 10. Compare two weights 11. Copies a square 12. Repeats a sentence of ten syllables 13. Counts for pennies 14. Unites the halves of a divided rectangle Age 6 15. Distinguishes between morning and afternoon 16. Defnes familiar words in terms of use 17. Copies a diamond 18. Counts thirteen pennies 19. Distinguishes pictures of ugly and pretty faces Age 7 20. Shows right hand and left ear 21. Describes a picture 22. Executes three instructions, given simultaneously 23. Counts the value of six coins, three of which are the same 24. Names four cardinal colors

Mental Tests and Measuring Scales

159

Age 8 25. Compares two objects from memory 26. Counts from 20 to 0 27. Notes omissions from pictures 28. Gives day and date 29. Repeats fve digits Age 9 30. Gives change from 20 sous 31. Defnes familiar words in terms of superior use 32. Recognizes all the pieces of money 33. Names the months of the year in order 34. Answers easy abstract questions Age 10 35. Arranges fve blocks in order of weight 36. Copies drawings from memory 37. Criticizes absurd statements 38. Answers difcult abstract questions 39. Can write two sentences that contain three given words Age 12 40. Resists suggestion while comparing lines 41. Composes one sentence containing three given words 42. Names 60 words in three minutes 43. Defnes certain abstract words 44. Discovers the sense of a disarranged sentence Age 15 45. Repeat seven digits 46. Find three rhymes for a given word 47. Repeats a sentence of 26 syllables 48. Interprets pictures 49. Solve a problem composed of several facts Adult 50. Solves the paper-cutting task 51. Rearranges a triangle in imagination 52. Gives diferences between pairs of abstract terms 53. Gives three diferences between a president and a king 54. Gives main idea of paragraph read aloud

by the time of the 1908 revision was what Spearman would later name as the “hotchpot” (or hodgepodge) approach. That is, the idea was to choose tests such that they sample as broadly as possible from a variety of specifc abilities, which alone or in combination could explain individual differences in intelligence. Some tests should focus on the ability to draw on short-term memory to complete a plan; others should focus on comprehension of the challenge posed by a novel task and the invention of an appropriate response; yet others should require mental imagery and manual dexterity. By the 1908 revision, one could identify at least eight different categories of abilities Binet’s test

160 Mental Tests and Measuring Scales

were targeted to elicit (see Figure 5.2), and in almost every case, these abilities interacted with one or more of comprehension, invention, direction, and censure. The second design principle Binet and Simon continued to emphasize was that the tests must sample not just from a range of intellectual abilities but should also allow for a diferentiation in a child’s level of sophistication with each ability. This required tests that distinguished between diferent degrees of ability, something that would only be possible if the tester had, or could establish, a theory for how the diferent abilities that underlie intelligence vary in sophistication. This was something Binet was able to accomplish by manipulating features of some tests to make them easier or harder to complete, and/ or by distinguishing between the features of a more or less sophisticated response. One of the best examples of the latter was a test Binet and Simon had introduced for the 1908 scale in which a child was presented with three different paintings, each containing persons and a theme, and asked, “What is this?” or “Tell me what you see here.” Figure 5.3 depicts the frst painting used for this test as part of the 1908 scale.

Test on the 1908 Binet–Simon Scale Used to Distinguish Children at Ages 3, 7, and 12.

FIGURE 5.3

Mental Tests and Measuring Scales

161

Binet and Simon distinguished between three hierarchically ordered categories of responses to this prompt: • •



Those that solely involved the recognition and identifcation of objects within the painting (typical of 3-year-old children), those that involved not only recognition and identifcation of objects but also a description of a relationship between the objects (typical of 7-year-old children), and those that ofered an interpretation that went beyond what was directly observable in the painting (typical of 12-year-old children in 1908 but modifed to 15 years in the 1911 revision).

Hence, this same test could be located at three diferent ages on the scale as a function of the quality of a child’s response. The design of tests such as these represent one of the frst examples of an empirically derived hypothesis of cognitive development, and it predated the work of Jean Piaget by almost 20 years (see Piaget, 1926, 1928). Yet, for all the thought given to the range and depth of intellectual abilities to be represented by his tests, when one considers its subsequent infuence, the most consequential change introduced in the 1908 and 1911 revisions was a method for quantifying a child’s general intellectual level. With a more fnely graded age scale now in place, the method was best characterized by its simplicity, as in general, it involved nothing more than a counting of tests that had been passed. By the 1908 revision, Binet had made dichotomous scoring rules for a correct test response more explicit (although he continued to recommend that the interviewer record additional codes to indicate distinct features of incorrect responses). The procedure for estimating a child’s intellectual level was to begin by administering a child the tests located at the child’s present chronological age. From there, the interviewer would administer easier or harder tests until an age could be found where the child could pass all the associated tests. Next, the child was given a credit of 1/5 of a year for each additional test at a higher age that had been completed successfully (since there were generally fve tests at each age interval). The resulting intellectual level was a number constructed to resemble a unit of time, and it could then be compared to a child’s chronological age in actual units of time. The difference, rounded to the nearest integer, was to be taken as a measure of the amount of a child’s intellectual retardation or advance. As an example, an 8-year-old child who passed all the tests for six years, along with two at age 7 and another two at age 8 would have an intellectual level of 6.8, which would correspond to a retardation of −1. In his normative samples, Binet found diferences of ±1 to be fairly common. Diferences of ±2 were rare, found among only 7% of the 1908 norming sample. So, as a rule of thumb, Binet suggested that children with intellectual levels two or more years below their chronological age would be those for whom an intervention would be most warranted.

162 Mental Tests and Measuring Scales

With the 1908 and 1911 revisions to the scale, what had originally been developed for the specifc purpose of diagnosing intellectual disabilities among children attending public schools was now evolving into a tool that Binet viewed as one with the potential for broader utility: Our principal conclusion is that we actually possess an instrument which allows us to measure the intellectual development of young children whose age is included between three and twelve years. The method appears to us practical, convenient and rapid. (Binet & Simon, 1908/1916, 261) In other words, Binet came to appreciate that his method could apply beyond the context of diagnosing learning disabilities, and sure enough, this led Terman to focus attention on the use of mental tests to identify children at the other end of the continuum, those who displayed evidence of “genius” by outperforming their chronological age.

5.3.4

The Role of Education

A defning aspect of Binet’s research between the frst and third iterations of the Binet–Simon scale and an aspect that most distinguished Binet from hereditarians such as Galton and Terman were the connections he drew between the study of individual diferences, human intelligence, and education. In particular, Binet held the conviction that a child’s intelligence was malleable through efective instruction. This perspective, already evident in his publications related to the Binet–Simon scale, was on clearest display in Binet’s engaging 1909 book Modern Ideas About Children. Because it is quite hard to fnd an English translation of this book, it is worth sharing some extensive excerpts of Binet (1909/1975) in his own words: After the evil comes the remedy. After identifying all types of intellectual defects, let us pass on to their treatment. We shall suppose, in order to lay bare the problem, that one of our students has defnitely been found to sufer from a distressing inability to understand what is going on in class. His judgment and his imagination are equally poor and, if he is not mentally defcient, he is at least considerably retarded to his educational development. What shall we do with him? What can we do for him? (105) In Binet’s (1909/1975) strongest and oft-quoted statement about the potential for education to infuence intellectual development, he wrote: I have often observed, to my regret, that a widespread prejudice exists with regard to the educability of intelligence. The familiar proverb, “When one

Mental Tests and Measuring Scales

163

is stupid, it is for a long time” seems to be accepted indiscriminately by teachers with a stunted critical judgment. These teachers lose interest in students with low intelligence. Their lack of sympathy and respect is illustrated by their unrestrained comments in the presence of children: “This child will never achieve anything. . . . He is poorly endowed. . . . He is not intelligent at all.” I have heard such rash statements too often. . . . A few modern philosophers seem to lend their moral support to these deplorable verdicts when they assert that an individual’s intelligence must be a fxed quantity, a quantity which cannot be increased. We must protest and react against this brutal pessimism. We shall attempt to prove that it is without foundation. (105–106) Binet sought out this proof within a laboratory school he had founded in Paris the same year that the original version of the Binet–Simon scale was frst published. It was the frst of its kind in Europe, and the impetus for founding the school had come from Binet’s collaborations with members of the La Société. What interested Binet the most was to see what could be accomplished when a child’s instruction could be “cut to their individual measures” (Wolf, 1973, 300). To study this, Binet and his colleagues admitted a cohort of children who had been diagnosed with an intellectual disability. Now, if a person’s general intellectual level can be understood as the sum of diferent parts depending on the requirements of an intellectual task, then surely, Binet reasoned, eforts can be devoted to bolster the parts to increase the whole. Binet put this idea into action in his lab school by engaging children in an ongoing program of “mental orthopedics.” To increase the capacity to pay attention and exert self-control, children would play a game in which they would be told to freeze in position in the pose of a statue until told to stop; to improve memory, they would be shown a collection of objects and then attempt to recall as many as possible; to improve will power, they would see how long they could squeeze a dynamometer without letting go; to improve speed and manual dexterity, the children would practice making as many dots on a piece of paper as they could in 10 seconds, and so on: With practice, training, and above all, method, we manage to increase our attention, our memory, our judgment and literally to become more intelligent than we were before. Improvement goes on until we reach our limit. (Binet, 1909/1975, 107) Teachers of these special classes did more than just practice mental orthopedics with the children; they were expected to meet the children where they were and to adjust their teaching to each child’s level of understanding. To this end,

164 Mental Tests and Measuring Scales

the teachers in the school followed what Binet considered the “greatest principle of pedagogy: proceed from the easy to the difcult.” In Modern Ideas About Children, Binet anticipates the work of both Piaget and Vygotsky in his emphasis on fnding a child’s stage, or level, of development. Efective instruction would then provide students with just enough challenge to be stimulating but not so much that it becomes discouraging. In the views he expressed about teaching, Binet showed himself to be a progressive educational reformer in the mold of John Dewey, and in Modern Ideas (1909/1975), he bemoaned the traditional and passive university-inspired approach of lecture-based instruction because it did not give students the opportunities to exercise judgment and to learn to adapt: Above all, the students must be active. Teaching is bad if it leaves the student inactive and inert. Instruction must be a chain of intelligent refexes, starting from the teacher, involving the student and coming back to the teacher . . . Philosophically speaking, all intellectual life consists of acts of adaptation. And instruction consists of making a child perform acts of adaptation, easy ones at frst, then more and more complex and perfect ones.  .  .  . Walk into a classroom. If you see all the children motionless, listening efortlessly to a fdgety teacher engaged in long-winded discourse, or if you see children copying, writing the course the teacher is dictating to them, tell yourself that is bad pedagogy. (115) According to Binet, after one year enrolled in the special classes at the laboratory school, the children had, on average, made the progress of two academic years in the span of one. By developing attention and self-control, and in response to individualized and active instruction, they were in a position to learn, showing growth instead of falling further and further behind. And it was the diagnostic application of the Binet–Simon scale that set the wheels in motion. This was the idealized use case, one that stands in contrast to the use of mental tests to sort students into fxed curricular tracks and vocational aspirations.

5.4 Binet’s Conceptualization of Measurement By the time of his 1908 revision, Binet can be said to have invented a measurement procedure—the administration of age-graded tests in a standardized interview setting—that could be used to generate a numeric value interpretable as a child’s general level of intelligence. Although Binet’s mental tests were, on the whole, quite diferent from the Galton–Cattell mental tests, they had one important and notable similarity: the results from the tests were expressed on a scale with a unit that had a physical referent. And not just any physical referent but time, the units for which—seconds, minutes, hours, days, months, and years—govern the functioning of human societies. Whether one was a doctor,

Mental Tests and Measuring Scales

165

a teacher, or a parent, the diference in mental functioning between a 5-yearold child and a 7-year-child took on an intuitive meaning in the way that a diference between 40% and 60% of test questions answered correctly did not. Most human adults would conclude that a diference of two years represents a signifcant magnitude. In an educational context, we can think of all the learning and maturation that is possible over a two-year span. A diference of 20% in test questions answered correctly does much less to stir the imagination or prompt a call for action. But was the sense of magnitude suggested by these temporal units real or just an illusion? In his writing, Binet acknowledged that his measuring scale of intelligence difered from physical measurement in an important way: This scale properly speaking does not permit the measure of the intelligence, because intellectual qualities are not super-posable, and therefore cannot be measured as linear surfaces are measured, but are on the contrary, a classifcation, a hierarchy among diverse intelligences; and for the necessities of practice this classifcation is equivalent to a measure. (Binet & Simon, 1905b/1916, 40–41) Binet (1909/1975) expanded on this in Modern Ideas About Children: Just what does the measurement of intelligence consist of? As in relation to instruction and physical development, the word “measurement” is not used here in a mathematical sense: it does not indicate the number of times a quantity is contained in another. For us the idea of measurement is closer to the ideal of hierarchical classifcation. The more intelligent of two children is the one whose performance is better on a certain kind of test. Moreover, taking into consideration the averages obtained in the testing of children of various ages, the measurement is established as a function of mental development. And for intelligence, as for instruction or physical development, we measure in terms of the retardation or advance a given child has in relation to other children his age. (102) Binet was acknowledging that intelligence was not additive in the same sense as an extensive attribute is additive. Beyond this, there is much that can be unpacked from his assertion that what his scale could produce was a “hierarchy among diverse intelligences” (see Michell, 2012b). The gist, however, was that Binet’s intellectual levels comprised, at best, an order, but the unit could not be given a consistent interpretation when comparing diferences along the scale. A diference in intellectual levels between 5 and 6 was not intended to be commensurate to a diference in intellectual levels between 9 and 10.

166 Mental Tests and Measuring Scales

In fact, Binet had found that even children who scored at the same intellectual level were invariably diferent in the combinations of tests completed successfully that put them at that level. It was possible that some of the abilities that infuenced the performance on Binet’s hodgepodge of tests were quantitative, in which case some diferences in intellectual levels might have been attributable to diferences in the magnitude of a single quantity (e.g., capacity of short-term memory); however, it was equally possible that the same diferences could be explained by the presence or absence of many qualitatively distinct abilities of unknown structure. Children could be ordered by their success on the Binet–Simon tests, but it was a heterogeneous order, and it was an order that would be found to violate basic rules of transitivity. Binet acknowledged as much when he wrote: it is evident that in whatever order we place the tests, we shall never be able to fnd any single test of such a nature that when this one has been passed, all the previous ones will also be successful, and all the following ones failures. .  .  . This order of tests might be established for one child in particular, but the same order would not be satisfactory for a second or a third. (Binet & Simon, 1908/1916, 242) Still, for all his cautions that what he was doing was not measurement in a traditional sense, Binet was nevertheless convinced that there were real distinctions to be made between children who could and could not pass certain collections of tests. In locating these results relative to the integer demarcations of age, was the diagnosis of a mental disability better than many of the likely alternatives, such as the diagnosis of abnormality in a child’s physical appearance or a child’s performance on a school-based examination? To the extent that a central purpose of measurement is the reduction of uncertainty about an attribute relative to alternative approaches, Binet clearly believed that this purpose was being accomplished. In presenting the 1908 revision, Binet shared the case of an 11-year-old child Germaine, who had been denied access to a public school. Binet had been able to show through the application of his measuring scale that Germaine’s intellectual level was only one year behind the norm for 11-yearolds, a fnding that was in stark contrast to evidence from the school-based examination she had taken. Binet’s method suggested that Germaine could catch up if provided with the right intervention. Although Binet did not argue that his measuring scale was a replacement for the judgments of teachers gathered over a much longer period, by 1911, the evidence from several studies convinced him that the basis for such judgments was frequently haphazard and that a teacher’s diagnosis of an individual child could easily be biased by any number of curricular and extracurricular factors. If faced with the need to make a yes-or-no decision as to

Mental Tests and Measuring Scales

167

whether a child required specialized educational services, Binet felt justifed that while he may have only been ofering “measurement” it was fne to leave of the quotes: “The Measurement of Intelligence” is, perhaps, the most oft repeated expression in psychology during these last few years. Some psychologists afrm that intelligence can be measured; others declare that it is impossible to measure intelligence. But there are still others, better informed, who ignore these theoretical discussions and apply themselves to the actual solving of the problem. (Binet & Simon, 1908/1916, 182) As a measure, the classifcation into an intellectual level that resulted from the application of the Binet–Simon scale was intended to be a convenient starting point, as Binet was always most attuned to the qualitative insights that lay beneath the intellectual level designation. Instead of means, standard deviations, and correlation coefcients, Binet would frequently produce multipage descriptive profles based not just on a child’s unique pattern of responses but also on observations jotted down during the course of the interview (recall the examples from Figure 5.1). All this was intended to provide the nuance necessary in contextualizing the comparison between a child’s intellectual level as compared to chronological age. Binet was emphatic that any quantifcation would be worthless without an accompanying qualitative interpretation, writing that in spite of the system of annotation which we have devised, we think it is the duty of the experimenter to judge, weigh and examine the replies. Our method is not an automatic weighing machine like those in railway stations, which register automatically the weight of a person, without his intervention or assistance. .  .  . The results of our examination have no value if deprived of all comment; they need to be interpreted. (Binet & Simon, 1908/1916, 222, 239) In all, Binet’s rationale for describing his procedure as an application of a measuring scale was flled with interesting contradictions. On one hand, he was well aware that his procedure only produced a rough classifcation, a classifcation with respect to many diferent abilities that underlie intelligence. On the other hand, it is rather fascinating that the main connection he sought to maintain with traditional measurement was in the development of a common scale with a recognized reference unit. Fechner had proposed psychological units in terms of just noticeable diferences; Galton had popularized relative measurement in standard deviation units. In terms of practical infuence, Binet outdid them both with his temporal units.

168 Mental Tests and Measuring Scales

5.5 Criticisms When Binet had said he was measuring intelligence, he had done so with a wink and a host of qualifcations, the nature of which would have been apparent to anyone reading his work in its entirety. Much of this was literally lost in translation, because when the 1908 revision was initially translated13 into English by Henry Goddard in 1910, it had been condensed from 90 pages into 16. As use of the scale became increasingly widespread in the years that followed Binet’s death, reviews by Ayres (1911), Burt (1914a, 1914b), Rogers and McIntyre (1914), Thorndike (1914, 1916), Otis (1916), and Yerkes (1917) raised two interacting classes of criticisms. The frst was specifc to the content of the tests to be included in the scale; the second was specifc to the interpretation of the age scale. Binet had never established specifc criteria for the design and selection of the tests that defned the Binet–Simon scale. Yes, he was purposely designing the tests so that each would require a heterogeneous combination of cognitive abilities, but the approach rested on Binet’s experience and intuition as well as trial and error, making the design a mix of science and art that would have been difcult for others to replicate. The most straightforward rationale Binet provided for the inclusion of a test was the empirical fnding that for a given norming sample, success on the tests was associated with age. Given this, it is no surprise that when Terman and his students began revising and expanding the Binet scale; they also relied upon a mix of intuition and empiricism to justify the contents of the test battery. There was therefore an inherent ambiguity in the content that was acceptable for inclusion on the tests. Those looking for a theoretical basis for intelligence in the avalanche of tests that came to defne the work of quantitative psychologists between 1910 and 1930 were barking up the wrong tree since, for practical purposes, “intelligence is what the tests test” (Boring, 1923).14 One comprehensive attempt to address this problem related to the ambiguity of content can be found within Thorndike’s conceptualization of intelligence as a combination of intellectual level (“altitude”), intellectual range (“width”), and speed (Thorndike, Woodyard, Cobb, & Bregman, 1927). To measure one or more of these dimensions, Thorndike proposed that tasks could be efectively sampled from four categories that provided an operational defnition of intelligence, with tasks within each category designed according to a hierarchical order of difculty. In each category for a given intellectual level, a subject would be expected to solve multiple tasks in which they would be asked to (1) supply the missing words to make a statement true and sensible, (2) solve mathematical problems, (3) understand the meaning of single words, and (4) comprehend oral or written discourse. It was also an open question whether the Binet–Simon scale could be expected to produce reliable results.15 Would one come to the same inference about a

Mental Tests and Measuring Scales

169

child’s intellectual level if the tests were to be administered by the same interviewer on a diferent occasion? Would the results be largely the same if the interview were to be conducted by a diferent interviewer? To what extent did the number of the tests and their order of administration matter (since Binet had largely left this to the discretion of the tester)? These were questions about sources that might contribute error to the procedure. Binet was just beginning to investigate some of these issues as part of his 1911 revision, and he was under no illusion that the choice of occasion, the setting, the tester, the number of tests, and the test order could be considered negligible factors. Putting to the side the ambiguity related to the content of the Binet–Simon scale and possible concerns about the reliability of its numeric results, what could be said about the properties of the scale itself? Three questions stood out. First, what methods and criteria should be used to assign a test to a specifc age? Second, just how sensitive was the test–age link to the selection of children in the norming sample? Third, under what conditions could the scale be used to compare the magnitude of intelligence found for diferent children and for groups of children (e.g., boys vs. girls, poor vs. rich, etc.)? With respect to the frst two of these questions, Binet and Simon had never established any single preferred method to link tests to an age location on the scale,16 and by the time of his 1911 revision, Binet was aware that the locations of tests could be sensitive to the cultural background and experience of the underlying norming sample. Given this, with respect to the third question, there would have been no good reason to presume that the scale could be used to make inferences about magnitudes of intelligence. Binet, as we have seen, had never claimed that the units of the scale marked out equal intervals of intelligence. Regarding the conceptual and technical questions about measurement that emerged from Binet’s work, a good case can be made that the proper historical contemporaries to consider as a contrast are not Henry Goddard, Lewis Terman, and Robert Yerkes but Charles Spearman, Edward Thorndike, and Louis Thurstone. It was Spearman who proposed a general theory of intelligence and a statistical framework for studying it that he believed could explain the apparent success of Binet’s hodgepodge approach to test design. We shall explore Spearman’s ideas and contributions along these lines in the following three chapters. Thorndike, for his part, was unwilling to grant what was at best a classifcatory procedure the status of measurement. If it were to be a measuring scale, that scale would need to have intervals marked of in equal units, and an efort should also be made to establish the origin of the scale. The exploration and evaluation of diferent methods that might be used toward this end form the crux of chapters 2 through 12 of the 1927 book The Measurement of Intelligence (Thorndike et al., 1927). The method Thorndike considered the most promising was essentially an application of Galton’s relative measurement approach. Tasks could be ranked by the proportion of test takers who answered them correctly, and through the application of the inverse cumulative normal distribution, the

170 Mental Tests and Measuring Scales

ranks could be converted into standard unit deviates. It would then be a separate job to decide on the proper location of the scale origin. Because the procedure depended on the Galtonian assumption that intelligence was both quantitative and normally distributed, a major contribution of Thorndike’s work was to build a better empirical case that intelligence test scores tend to be normally distributed. He did this by collecting data on 11 preexisting intelligence tests that had all been administered to sixth- and ninthgrade students in diferent cities across the United States, with sample sizes of test takers ranging from a few hundred to nearly 6,000. The plots of the frequency distributions of each of these tests could be relatively well approximated by a normal curve, and when each was expressed in standard units and combined into a single frequency distribution, the ft to the normal curve was that much stronger. Thorndike was also able to show that for a subsample of sixth-grade students who had taken six diferent intelligence tests, as scores were successively cumulated, the symmetry and spread of the resulting distribution stayed about the same. In doing this, Thorndike argued that he could rule out normality as an artifact of measurement error in any single test administration by the logic that if this were the cause, then the cumulated test scores should appear less symmetric with less variability. An alternative rationale for establishing the unit for a scale of intellectual level was now falling into place. The most complete and infuential realization of this rationale for educational and psychological “scaling” would come from a series of seminal contributions by Louis Thurstone introducing and applying a method of “absolute scaling” (Thurstone, 1925, 1926, 1927c, 1928a, 1929).

5.6 Binet’s Legacy How is it that some new ideas take root and others die on the vine? It surely helps when the idea in question appears to ofer an expedient solution to a widely perceived problem, and the Binet–Simon scale ft the bill. But Binet also benefted from some serendipity. Had he been left to his own devices, it seems unlikely that the Binet–Simon scale would have attracted a great deal of attention. Binet was prolifc when it came to publishing his work within French outlets, but as an academic outsider, he lacked a network of colleagues and students to help him disseminate it, and he never traveled to present his work at international conferences. As Siegler (1992, 186) puts it, “Binet’s product was strong, but his marketing was weak.” The serendipity came in the form of the American psychologist Henry Goddard, a former student of G. Stanley Hall, one of America’s most prominent educational psychologists. Goddard, newly appointed in 1906 as the director of the Vineland Research Laboratory for the study of “feeble-mindedness,” had been searching the extant literature for approaches that could be implemented to diagnose a mental disability. Somehow L’année and Binet’s 1905 articles did not fnd their way into this search. Binet’s

Mental Tests and Measuring Scales

171

work was frst brought to Goddard’s attention two years later in the spring of 1908 while he was visiting Europe and to meet with Ovide Decroly, a Belgian psychologist who was directing a neurological clinic for the study of children with physical and mental disabilities. Decroly’s student Julia Degand had just applied the Binet–Simon scale to a sample of Belgian children attending private schools, and the two were in the process of ofering up a critique17 based on their fndings (Decroly & Degand, 1910). After becoming acquainted with Binet’s approach and giving it a try with the inmates at the Vineland Research Laboratory in 1909, Goddard (1916) overcame some initial skepticism to become one of the scale’s most infuential proponents in the United States along with Terman: It seemed impossible to grade intelligence in that way. It was too easy, too simple.  .  .  . Our use of the scale was a surprise and a gratifcation. It met our needs. A classifcation of our children based on the Scale agreed with the Institution experience. (5) Goddard immediately had an abbreviated version of Binet and Simon’s article presenting the 1908 revision translated into English in 1910, and his frst largescale application of the scale was published the same year (Goddard, 1910). The application, discussion, and critique of the Binet–Simon scale accelerated so rapidly from there that already in 1912 a review of literature inspired by the scale began with the statement that “[p]erhaps no device pertaining to education has ever risen to such sudden prominence in public interest throughout the world as the Binet-Simon measuring scale of intelligence” (Bell, 1912, 102). By 1914, a bibliography of literature related to the Binet–Simon scale included 254 citations, and it was being used in Canada, England, Australia, New Zealand, South Africa, Germany, Switzerland, Italy, Russia, China, Japan, and Turkey. Prominent studies were published by Goddard (1911), Terman (1911), Terman and Childs (1912), and Johnston (1911) in the United States; Jeronutti (1912) in Italy; Decroly and Degand (1910) in Belgium; Bobertag (1911) in Germany; and Rogers and McIntyre (1914) in Scotland. As noted at the outset of this chapter, although it had been Goddard who introduced Binet’s approach in the United States, it was Terman who did the most to frst adapt and then transform it so that it could become the basis for a national testing movement heralded by the Army Alpha and Beta exams. Terman had adapted Binet’s age scale for the American context by using new and larger norming samples and by adding new tests, especially ones targeted to the ages between 11 and 16. Terman had also adopted the recommendation made by the German psychologist William Stern (1913) that instead of examining the diference between a child’s intellectual level and chronological age, it was more sensible to examine the ratio, since a defcit of two years for a

172 Mental Tests and Measuring Scales

child at the age of 6 should not be given the same interpretation as the same defcit for a child at the age of 12. Taken as a ratio of chronological age, the deviation observed for the 6-year-old was twice that of the deviation for the 12-year-old. Binet is most famous for three features that characterized the last two iterations of the Binet–Simon measuring scale of intelligence: the use of tests designed to sample a heterogeneous mix of cognitive abilities, the use of a normative sample to establish a hierarchical ordering of tests in terms of difculty, and the use of age as the unit in which results could be conveyed. These were, of course, notable accomplishments that broke with the conventions of the time. However, they miss the larger gestalt of Binet’s ambition to build a unifed account of human psychology. To do this required a program of research in which it was necessary to understand the mechanisms and complexities underlying psychological phenomena, and this is what led him to take an approach to measurement that emphasized diagnosis. Binet can be considered the frst psychologist to consider a procedure that resulted in classifcation as comprising, for practical purposes, an act of measurement. What is less well appreciated about Binet is the extent to which age-scale classifcations were intended to be complemented by qualitative insights gleaned from analysis of the kinds of errors a child would make in tests that were failed. If the purpose of administering the tests was to decide on an educational course of action, this required a comprehensive approach of triangulating information gathered from the longer term observations of parents and teachers, the results from both psychological and pedagogical tests, and even physical measurements taken by medical doctors. The diagnosis was intended to reveal for each child a profle of strengths and weaknesses, and these could be used to tailor an appropriate instructional intervention. Although the Binet–Simon scale did solve one problem as a replacement or complement to some of the more idiosyncratic methods that had been used to diagnose children (and adults) with mild to moderate intellectual disabilities, it created new ones with consequences that Binet might not have anticipated. For all his warnings as to the danger of quantifcation without qualitative interpretation, this was a bit like handing a 2-year-old child a lit match and then expressing shock when the child grabs at the fame. What attracted the attention of those who took up the Binet–Simon scale was the apparent ease of quantifcation. It was called a measuring scale of intelligence, and it produced a measure in units of time that sure looked a lot like the kinds of measures Galton, Pearson, and Yule were using for correlational research. Why restrict its use to diagnostic classifcation? Why not also use it to compare, sort, and evaluate the magnitude of diferences between groups of individuals? This was not, in fact, outside the bounds of what Binet himself had described as an eventual use for the scale (Binet & Simon, 1908/1916, 262–263). For all the ways that Binet’s work represented a break from Galton and Fechner, he nonetheless shared the same conviction that the concept of

Mental Tests and Measuring Scales

173

measurement could be applied to the context of psychological attributes. And much like Thorndike (1912), Binet held the conviction that education could and should be improved through the application of measurement and experimentation. In this sense, it places Binet on far too lofty a pedestal to regard him as a hapless bystander to the misguided applications and extensions of his approach by Henry Goddard and Lewis Terman. The moment that Binet decided to express intelligence on a scale marked of in age units, it was inevitable that the numbers would take on a life of their own and that, if not given careful attention, questions about the theoretical status of the underlying attribute and the properties of the scale could get lost in the shufe. Binet was already on a somewhat slippery slope at the time of his death, having created a procedure for a partial classifcation but referring to it as if it were measurement. Had he lived another 20 years, would Binet have denounced the mental testing movement of the 1920s or helped shape it for the better? Would he have focused greater attention on the properties of his age units, or moved on to other pursuits? And if Binet had been transported into the 21st century, what would he have made of the tendency of modern educational researchers to express and compare growth in test score performance over time in terms of “weeks of learning” (e.g., Chiang et al., 2015)? In his A History of Experimental Psychology, Boring describes the 1880s as Galton’s decade on greatest infuence, the 1890s as Cattell’s, and the 1900s as Binet’s. Galton introduced correlational methods as the way to study individual diferences, Cattell introduced mental tests as the measures of psychological attributes to which correlational methods were to be applied, and Binet succeeded in shifting attention to the measurement of the more complex mental processes on a scale of temporal units. Tragically, with the world of psychology discussing and debating the uses and interpretations his instrument could aford, Alfred Binet fell ill and died at the height of his powers and infuence. According to Wolf (1973), during the last week of his life, Binet is said to have remarked to Simon, “If I could have only have had fve more years!” In this sense, we can only echo the laconic sentiments of Boring (1950), who refected “It might have been useful to have had him live a little longer” (573).

5.7

Sources and Further Reading

The starting point to appreciate Alfred Binet’s measuring scale of intelligence is the 1916 book The Development of Intelligence in Children. This book represents a compilation and translation of the fve publications that originally appeared in L’année Psychologique, the three that introduced the Binet–Simon scale in 1905, and two presenting its revisions in 1908 and 1911. The book, which includes an introduction by Henry Goddard who arranged to have it assembled, was translated by Elizabeth S. Kite and can be downloaded for free.18 Binet was a prolifc author; unfortunately for the monolingual English speaker, only a

174 Mental Tests and Measuring Scales

relatively small fraction of his works have been translated from French.19 An important book to read that has been translated in English but that is difcult to fnd in print is Binet’s book Modern Ideas About Children. Another resource for reading Binet in his own words is the book Experimental Psychology of Alfred Binet: Selected Papers by Robert Pollack and Margaret Brenner. The 14 publications contained in this book are all studies published between 1880 and 1903 that predate Binet’s development of his measuring scale of intelligence, and they provide insights into his broader program of research and contributions to experimental psychology. For a review and translation of Binet and Henri’s work on individual psychology into English, see Nicolas, Coubart & Lubart (2014). For a deeper look at Binet’s program of experimental research as director of the psychology laboratory at the Sorbonne, see Nicolas, & Sanitioso (2012). The indispensable biography of Alfred Binet is Theta Wolf ’s book Alfred Binet. Two shorter but enlightening biographical and career retrospectives can be found in Siegler (1992), the introduction to Pollack and Brenner’s book, and chapter 2 of Fancher (1985b). The rise of intelligence testing in the United States between 1910 and 1930 has been covered by others in great detail, which is one reason I do not go into it in more detail here. Two books that I would recommend are Stephen Jay Gould’s The Mismeasure of Man and Michael Sokal’s edited volume Psychological Testing and American Society: 1890–1930. It would be a mistake to only read the former and not the latter. Gould’s book is engaging and thought-provoking, but it frequently fails to provide important historical context. The contributions found in Sokal’s book provide some great insights into the roles of James Cattell, Robert Yerkes, Henry Goddard and Lewis Terman in the rise of mental testing. Of the fgures who had the greatest infuence during this period, Lewis Terman stands out for his lack of humility (a case in point was his performance during his published debates with the journalist Walter Lippman). But even Terman is not easily rendered in black-and-white, and with this in mind, I could have written a chapter with him as the focal point. One reason I did not is that unlike the other fgures covered in this book, Terman was more of an administrator than a methodological innovator. A good analysis of Terman and his infuence on the use of testing in school settings can be found in Paul Davis Chapman’s book Schools as Sorters: Lewis M. Terman, Applied Psychology, and the Intelligence Testing Movement, 1890–1930. For a more standard biography of Terman, see Henry Minton’s book Lewis M. Terman: Pioneer in Psychological Testing.

Notes 1 The formative nature of Binet’s time as a “library psychologist” was promulgated years later by Binet himself in a letter written to the professor, Gaston Paris, who had arranged for his admission to the National Library (as it was not open to the public). According to Wolf (1973, 3–4), Binet wrote to Paris that “it was my studies [there] that decided by vocation.”

Mental Tests and Measuring Scales

175

2 The diagnosis of hysteria that had been invented for the women who were inmates at the hospital was essentially a grab bag of symptoms that ranged from things that had at least the veneer of objectivity (such as paralysis, fainting, insomnia, and loss of memory) to those that were almost purely in the eye of the beholder (such as anxiety and sexual desire). Hysteria is not a neurological condition, and it is no longer recognized as a disorder by medical professionals. The term was also dropped from usage as a psychiatric description by the American Psychiatric Association in 1952. 3 Wolf (1973, 7) writes, “He summarized Balbiani’s lectures on heredity for publication. The bibliographical references provided by Balbiani were current and must have furnished Binet with a healthy antidote to Mill’s fagrant environmentalist position.” 4 Under his leadership (he became vice president in 1901 and president in 1902), the infuence of La Société grew dramatically (its members increased from 200 in 1900 to 750 in 1911). 5 The term debile could have been given the literal translation or “weak, stupid, or feeble.” At the time, “feebleminded” was already being used to refer to anyone with an intellectual disability (including “idiots” and “imbeciles”) and the terms weak and stupid already had established pejorative connotations. This likely explains Goddard’s decision to adopt the term moron, which derives from the Greek word moros, or dull. Goddard is portrayed as a two-dimensional nefarious character by Gould (1981), but as always, the truth seems more complicated. For a more three-dimensional look at Henry Goddard and his evolving perspective on the diagnosis and treatment of an individual with intellectual disabilities, see Zenderland (1987) and Cravens (1987). 6 Binet: “They seem to reason in the following way: ‘Here is an excellent opportunity for getting rid of all the children who trouble us,’ and without the true critical spirit, they designate all who are unruly, or disinterested in the school” (Binet & Simon, 1916, 169). 7 It is well established that although Théodore Simon played a critical role in helping Binet get access to subjects to try out their mental tests and in the work involved in designing and revising the tests, the publications on the measuring scale were all written solely by Binet. Hence, when referring to the writing from these sources, I only reference Binet, who was the intellectual force behind the method. But when referencing the name of the scale itself and the development of the 1905 and 1908 tests, I include Simon. 8 It would also be reasonable to refer to these diferent tests as “tasks” or “items,” and the collected items administered as the full test. However, in many cases, what Binet referred to as a test did itself consist of multiple tasks, so to avoid confusion, I use the term test throughout even though some tests consisted of a single task. 9 These were taken from the work of the German psychologist Hermann Ebbinghaus (1850–1909). 10 “We did not know a single child; they appeared to us for the frst when they came to the examination. We know, however, that all were normal. The [school]masters were asked to designate only children of average intelligence, who were neither in advance of no behind children of their own age, and who attended the grade correct for their years . . . we required that the subjects chosen should have an exact number of years in order that the development should be typical of that age” (Binet & Simon, 1905c/1916, 92). 11 Throughout all three iterations of the scale, Binet remained vague about the exact number of students underlying the age norms or the sense in which they could be considered representative of some target population. 12 Four of the fve sentences contained similarly dark humor, which was apparently considered too “gruesome” or “frightful” to administer to American children when

176 Mental Tests and Measuring Scales

the tests were translated to English, so diferent sentences were substituted. Binet seems to have found this amusing, noting in Binet and Simon (1908/1916) that “our Parisian youths laugh at them.” but he also cautioned that fnding substitutes of similar difculty might not be so straightforward if they did not involve a culturally equivalent sense of the absurd in the developing mind of a child. 13 Goddard rectifed this six years later in 1916 by publishing, in collaboration with Elizabeth Kite the complete translation of Binet and Simon’s work between 1905 and 1911 that I reference throughout this chapter as Binet and Simon (1916). 14 This quote from Boring is used to imply Boring’s tacit approval for this state afairs. The full quote in context shows this was not the case; Boring was only ofering a pragmatic assessment of current practice: This is a narrow defnition, but it is the only point of departure for a rigorous discussion of the tests. It would be better if the psychologists could have used some other and more technical term, since the ordinary connotation of intelligence is much broader. The damage is done, however, and no harm need result if we but remember that measurable intelligence is simply what the tests of intelligence test, until further scientifc observation allows us to extend the defnition. (Boring, 1923, 35) 15 The concept of reliability and the estimation of a reliability coefcient in the context of mental testing, frst discussed by Edgeworth (1888) in the context of school examinations, would fgure prominently in the work of Charles Spearman at the same time that Binet was developing the three iterations of his scale. See Chapter 6. 16 In an interview with Theta Wolf, Simon explained: “We abandoned the tests that did not demonstrate patent diferences. But we never applied the rule of three-quarters [75% success], which, after us, was demanded to place a test at a determined age. This rule was formulated by the German author, O. Bobertag. It is convenient, but for my part I do not believe it very good.  .  .  . There are some tests whose results improve year by year; some that give only mediocre results for many years, and then abruptly the number of successes increases. These are much the best .  .  . and as much as possible we kept them. . . .” (Simon, 1954 as cited by Wolf, 1973, 176). 17 Indeed, it seems much of the impetus for Binet’s 1911 revision and his publication describing it were to respond to and rebut the Decroly and Degand critique. 18 Visit http://tinyurl.com/leopold-developmentofnt00bineuoft to download Kite’s book. 19 For a compilation of French and English translations, see https://sites.google.com/ site/alfredbinet18571911/home/oeuvre.

6 MEASUREMENT ERROR AND THE CONCEPT OF RELIABILITY

6.1

Overview

Although measurement procedures are designed to follow a standard protocol, when the same procedure is replicated on diferent occasions, the measure we observe will typically vary. What can we conclude about this state of afairs? One possibility is that the diferences in the measures can be attributed to some unintended deviation from the procedure and that this deviation can be conceptualized as an “error” that behaves as if it were a random event. On some occasions, it will be positive; on others, it will be negative; but across repeated occasions, it will average out to zero. The simplest model for what I am describing is that any measure we observe, X, can be decomposed into two elements, a value that is real or “true” (T ), and another value that is error, (E ); hence, X = T + E. Now, what is meant by a true value and an error value is not at all obvious, and it is something that we will be contemplating in this chapter. Nonetheless, many readers are likely to recognize the preceding equation as the linear model that also happens to serve as the fundamental equation of classical test theory (Novick, 1966; Lord & Novick, 1968). The idea that measurement always involves some degree of uncertainty due to error suggests that one way to improve the quality of a measurement procedure is to make purposeful design decisions that will reduce this error. In the physical sciences, this is often accomplished by minimizing the role of human judgment. This does not remove the notion of errors in measurement from the story, but it tends to reduce the size of the error to an amount that is negligible for the purpose at hand. DOI: 10.1201/9780429275326-6

178 Measurement Error and Reliability

By contrast, in the human sciences, while sources of error can also be anticipated and reduced, human judgment is not just some undesirable source of confounding but typically a central aspect of the procedure itself. In this chapter, we focus particular attention on the use of a test as a measurement procedure. Tests are written to elicit responses from individuals, and some method needs to be employed to translate the responses into scores. In some cases, individuals essentially score themselves in the choices they make to questions with fxed response options; in others, written responses are scored by an external observer. For the moment, let us grant the idea, without yet interrogating it, that upon replications of a measurement procedure, the scores we would observe will vary due to one or more sources of error. Now, if this is true, and if all individual test scores are subject to error, one question we might ask is what proportion of the observed variability in individual diferences, σ2 (X), is attributable to real diferσ 2 (T ) 2 = ρ ences, σ (T). That is, what is the value of X ? The term ρX is a σ 2 (X ) reliability coefcient, and to a great extent, the entire purpose of classical test theory is to, on one hand, make the assumptions and defnitions that motivate this coeffcient explicit, and to, on the other hand, propose a variety of statistics that can be used to estimate it from the data gathered in one or more test adminstrations. In 1950, Harold Gulliksen published what was at the time the most comprehensive (and accessible) treatment of test theory in his book Theory of Mental Tests. In the introductory chapter of this book, Gulliksen (1950) writes: Scientifc confdence in the possibilities of measuring individual diferences revived in this country [United States] with the introduction of the Binet scale and the quantitative techniques developed by Karl Pearson and Charles Spearman at the beginning of the twentieth century. Nearly all the basic formulas that are particularly useful in test theory are found in Spearman’s early papers; see Spearman (1904a, 1904b), (1907), (1910), and (1913). (1) The phrasing of this passage may seem curious, as Gulliksen was in essence characterizing the test theory at the heart of his book as stemming from a set of “basic formulas” that were originally “found” in the early papers of Charles Spearman. Yet this is exactly right. The formulas that Gulliksen references were ones that Spearman had frst proposed and applied at the turn of the 20th century to the correlational study of individual diferences. In this chapter, we explore the context that led Spearman to propose these formulas, the conceptual issues the formulas raise about the nature of measurement error, and the controversy that ensued from his application of the formulas to support a theory of general intelligence. In the following chapter, we see how Spearman’s ideas and observations about individual diferences led him to the complete development of his

Measurement Error and Reliability

179

two-factor theory of mental ability and invent the method that would evolve into factor analysis. It is also in the next chapter that we explore the broader question of how Spearman seems to have conceived of measurement in the human sciences. For the time being, we put this question on hold and accept the premises that (1) the act of ranking or testing students with respect to some mental attribute of interest can be conceptualized as a measurement procedure, (2) the procedure is infuenced by errors in human judgment, and, fnally, (3) when certain assumptions hold, that the infuence of these errors can be quantifed.

6.2

Spearman’s Background

Charles Spearman was born in London in 1863, but beyond this unremarkable fact, very little is known about his education, interests, and upbringing as he approached his 20th year in 1883. His upbringing was upper middle class, but unlike his predecessor in the study of individual diferences, Francis Galton, he had neither a distinguished family lineage nor accumulated wealth to rely on. An only child, Spearman’s father died when he was just 2

FIGURE 6.1

Charles Spearman (1863–1945).

Source: © National Portrait Gallery, London.

180 Measurement Error and Reliability

years old, leaving him with a widowed mother and a half brother from his father’s frst marriage. According to Lovie and Lovie (1995), between 1876 and 1882, Spearman attended a public school in Warwickshire (about 90 miles north of London), a location near the new residence of his mother, who had remarried in 1870. Unsure about a direction for a professional career, Spearman enlisted in the military and was given a commission that placed him in India. By 1894, Spearman had attained the rank of captain and returned to England for a two-year period to attend Staf College in Camberly. He emerged with a PSC (“passed staf college”) designation and returned to his regiment in 1896. Just two years later, Spearman resigned his commission and decided to pursue a PhD in psychology at the University of Leipzig as a part of Wilhelm Wundt’s already famous laboratory. But before he could fully immerse himself in his studies, Spearman was called back to British military service from 1899 to 1902 in support of the Boer War.1 At the culmination of this service, Spearman, now married, spent a few months with his wife and newborn daughter living in the Berkshire village of Appleton before returning to Leipzig to continue his doctoral studies in December 1902.2 It was during these months prior to his return to Germany that Spearman, inspired by reading Galton’s book Inquiries Into Human Faculties, decided to explore some of Galton’s hypotheses about the nature of human intelligence by collecting data on the children attending his neighborhood school. Spearman ultimately turned these data, and the analyses it precipitated, into two published articles in the American Journal of Psychology (Spearman, 1904b, 1904c). These two publications would prove to be the most infuential of his career and collectively may well be the most infuential papers in the history of quantitative psychology. After acquiring his PhD in 1906 at the age of 43, Spearman returned to England in 1907, succeeding William McDougall as a Reader3 in Experimental Psychology and the head of a small psychological laboratory at University College London. In 1911, just as Karl Pearson was appointed to the newly established Galton Chair at University College London, Spearman had already been promoted to Grote Professor of Philosophy of Mind and Logic and was the head of a psychology concentration within a larger department of philosophy. This was a position he would maintain through 1928, when his program separated from philosophy and became its own department of psychology. Upon his retirement in 1931, he was the University College London’s frst and most famous professor of psychology and a fellow of the Royal Society (elected in 1924). Spearman remained active until his death in 1945, particularly during the 1930s when he took frequent trips to the United States for multiple extended occasions to attend professional conferences, lecture at universities, serve on invited committees, and visit with former students. When it comes to understanding Spearman’s career choices and motivations, mysteries abound. In his autobiography, Spearman has no more than two

Measurement Error and Reliability

181

paragraphs to spare for the period in his life that preceded his decision to enlist in the military. All that we know is that in Spearman’s refections on his childhood (at the age of 67), he recalls two driving forces, a devotion to “games and sport” and an abiding interest in philosophy. Similarly, we know very little about Spearman’s initial 12 years of military service or the impetus for his decision to leave the military and seek out a graduate degree in psychology within two years of completing a two-year undergraduate degree at Staf College.4 In his refections, Spearman (1930) described his long stint in the military as the mistake of his life: I had decided to turn to a short spell of military service. This diversion of activity was, for one reason or another, allowed to spin out far longer than originally anticipated. And for these wasted years I have since mourned as bitterly as ever Tiberius did for his lost legions. (300) Was Spearman born with skills in organization, leadership, and self-discipline that made him successful in both the military and academia, or were Spearman’s experiences in the military environment critical to his success in the academic one? We can never know, and as in so many things, it was likely a mixture of both nature and nurture. Nonetheless, all evidence suggests that Spearman quickly came to appreciate that strategy and combat were just as central to academic success as they were to military success. Another side beneft of Spearman’s stint in the military had been the signifcant periods of downtime that he flled by reading works of philosophy. It was during this period that Spearman, like Galton before him, came to form a decidedly negative view of the empiricist tradition characterized by the writing of David Hume, David Hartley, and John Stuart Mill, which held that all mental experience was (in Spearman’s words) “at bottom nothing more than an aggregate of sensations variously associated with one another” (Spearman, 1930, 300). Throughout his career, Spearman maintained a strong objection to associationism (also known as sensationalism, or sometimes sensualism) and it was one of the reasons that he and Edward Thorndike would come to such diferent theories about the workings of human cognition and the nature of intelligence. Spearman’s biography is a reminder of how often great discoveries are serendipitous. In Spearman’s telling, it was on something of a lark that after reading Galton’s Human Faculties, he decided to conduct his own study to explore the theory that tests of sensory judgment and reaction (what Cattell had frst called “mental tests”) could be used in place of direct judgments of observers (i.e., teachers, examiners) as measures of intelligence. The result of Spearman’s study ofered what appeared to be strong support for Galton’s theory: all the measures he had collected were in fact positively correlated. It was only after Spearman had completed this frst small study that he became aware of a preexisting study that had just recently been conducted on a much larger scale by Clark Wissler,

182 Measurement Error and Reliability

an American graduate student and advisee of Cattell’s at Columbia University. In this study, published in 1901, Wissler had performed a comparison between mental tests and academic performance similar to that of Spearman but with a much larger sample of college-age students and using a wider variety of mental tests and measures of academic achievement. In contrast to Spearman’s results, Wissler had found that mental tests were essentially uncorrelated with grades in academic subjects and not even correlated among themselves. It was the need to reconcile Wissler’s contradictory fndings with his own that led Spearman to the methodological and theoretical contributions for which he is most famous.

6.3 Disattenuating Correlation Coeffcients Galton’s work in anthropometrics during the 1880s and his introduction of the method of correlation in 1888 marked the advent of the study of individual diferences as a distinct feld within psychology. What Galton and Cattell had established was the potential to administer mental tests under standardized conditions with great efciency. What was more speculative was that mental tests might serve as a “purer” measure of native intelligence because they were unlikely to be infuenced by environmental diferences in a person’s upbringing. Also near the turn of the century, Pearson had given Galton’s method of correlation, which Galton had worked out graphically through geometric intuition, a more complete mathematical treatment, and had introduced a simple formula, frst derived by the mathematician Auguste Bravais (Stigler, 1986, 353), that could be used to compute a correlation coefcient between any two continuous variables, x and y. The full form of the expression for a sample comprised of n individuals is rx y =

∑ni=1 (xi − x )(yi − y ) 2

∑ni=1 (xi − x )

2

∑ni=1 (yi − y )

.

(6.1)

Here the numerator takes, for each variable, the product of each individual’s deviation from the mean—the covariance—and then these products are summed across all individuals. The result is placed on a common –1 to 1 scale by dividing it by the product of the standard deviations of each variable. A simpler mathematically equivalent expression with arguably greater intuitive appeal is rx y =

1 ∑ni =1 zx i z y i , n −1

(6.2)

where the terms zx i and z y i are standardized versions of x and y. That is, xi − x and analogously for z y i . Here we can see that the correlazx i = 2 ∑ni=1 (xi − x ) tion coefcient is simply the average of the standardized products, which is why it is often described as the “Pearson product-moment correlation.”

Measurement Error and Reliability

183

This was the new tool that Clark Wissler (1901) had used to compare the performance of roughly 200 college students on the mental tests from Cattell’s laboratory. Wissler examined the correlations not only among the scores on mental tests but also with students’ grades in a variety of college courses. What he found seemed to indicate that whatever it was that mental tests were measuring, it was something quite diferent from the attributes that would explain individual diferences in academic rankings. While college grades across diferent subjects had moderate intercorrelations, mental test scores and college grades were essentially uncorrelated. The correlations Wissler found were, on the whole, so low as to suggest little support for the thesis that mental tests could provide insights about human intelligence. A study conducted the following year on younger students yielded similar fndings (Aikens, Thorndike, & Hubbell, 1902). Spearman, as I have noted, was apparently unaware of these fndings when he launched his own investigation into the correlation between mental tests of sensory discrimination and other competing measures of intelligence. His mental tests consisted of tasks intended to stimulate the senses of hearing, sight, and touch. In these tasks, children were asked to distinguish between diferences in the pitch of two sounds, the weight in two containers, and the light in two pictures. The competing measures of intelligence were of two diferent varieties: the holistic judgments of an observer and the children’s performance on school examinations. Spearman regarded these as measures of “intelligence”5 in the sense that they corresponded most closely to common societal conceptions of the term. In following Galton’s line of thought, he suspected that the newly developed mental tests were equally acceptable candidates for the measurement of intelligence—whatever this meant—but to maintain an a priori distinction, he characterized his mental tests as measures of “discrimination.” Spearman’s data came from two British schools. The frst was a “village school” with a total of 60 children ranging in age from 5 to 13 years; the second school was “high class” preparatory school composed of 37 boys with ages ranging from 9 to 13. The correlations he would compute between measures of discrimination and intelligence came from subsamples of the students attending these two schools, and as summarized in Table 6.1, the ones to which Spearman devoted the most attention were the 24 oldest children (males and females between the ages of

TABLE 6.1 The Primary Analytic Samples Used in Spearman’s 1904 Publications

 

N

Ages

Sex

Measures

Village School

24

10–13

Male and Female

Prep School

22

9–13

Male

Mental tests of pitch, weight, and light discrimination Holistic rankings school cleverness and common sense Mental test of pitch discrimination School examinations in the subjects of the classics, French, English, mathematics, and music

184 Measurement Error and Reliability

10 and 13) from the village school and 22 children from the preparatory school (boys between the ages of 9 and 13).6 (In what follows for the rest of this chapter, I refer to these as the village school and prep school samples, respectively.) Measures of discrimination were based on all three of Spearman’s mental tests in the village school sample but only pitch discrimination in the prep school sample (as only tests of pitch discrimination could be easily administered to a group). His competing measures of intelligence were based on holistic rankings in the village school sample. Spearman had asked the children’s teacher to rank the children with respect to their “brightness of schoolwork,” and then asked the two oldest children in the sample to independently rank their classmates with respect their “sharpness and common sense out of school.”7 In the prep school sample, competing measures of intelligence were based on the rankings of children provided to Spearman from the head of the school in the subjects of the classics, French, English, mathematics, and music on three different occasions (Christmas 1902, Easter 1903, and July 1903). Consider the results from Spearman’s village school sample, summarized in Table 6.2.With the single exception of the correlation between pitch and light discrimination, all measures of discrimination and intelligence were positively correlated, and Spearman also found that the correlations were statistically signifcant. Furthermore, while the average intercorrelation of his three discrimination measures (r = .23) was lower than the intercorrelation of either of his two intelligence measures (r = .65 or r = .54), the average intercorrelation between measures of discrimination and intelligence fell between these extremes, at r = .38.Why had Spearman found, with small samples, that measures of sensory discrimination and intelligence were signifcantly correlated, while Wissler, with much larger samples, had only found signifcant intercorrelations between measures based on academic grades? In puzzling over this, Spearman had a self-described “happy thought” (Spearman, 1930, 322), and that happy thought led to a formula for the disattenuation of a correlation coeffcient that is still used today:

TABLE 6.2 Observed Correlations From the Village Sample (N  = 24)

 

 

Mental Tests of Discrimination

Holistic Judgments of Intelligence

Pitch Light Weight School Common Cleverness Sense A Mental Tests Pitch of Discrimination Light Weight Holistic School Cleverness Judgments of Intelligence Common Sense A Common Sense B

1.0 — — 0.25

–0.02 1.0 — 0.47

0.44 0.42 0.41 0.44

Common Sense B

0.41 0.30 1.0 0.37

.25 (.18) 0.44 0.42 .47 (.38) 0.42 0.45 .45 (.41) 0.31 0.29 1.0 .65 (.51) .54 (.55)

0.38 0.27

— —

1.0 0.64

0.73 1.0

Measurement Error and Reliability

rx y =

rx ′ y ′ rx ′ x ′ ry ′ y ′

.

185

(6.3)

In Equation 6.3, the observed correlation between two variables (rx ′ y ′ ) is divided by the geometric mean ( rx ′ x ′ ry ′ y ′ ) of what Spearman would initially describe as “the average correlation of several independently obtained values” of x and y. The result is an estimate of a correlation coefcient (rx y) that has been adjusted for the efect of “observational errors.” While the rationale for the corrective formula itself may not be immediately intuitive, Spearman’s idea of a correlation that has been weakened as a consequence of errors in measurement can be readily visualized through simulation. Consider the sequence of three scatterplots depicted in Figure 6.2. The two variables represented in each plot are hypothetical measures of adult height for a sample of 100 brothers, in which each point plotted represents a pair of brothers (i.e., height of brother 1 on the x-axis, height of brother 2 on the y-axis). In the frst plot on the left, we imagine the possibility that height is either measured perfectly or with so little error that it would not be discernible to the human senses. In this idealized case, we fnd that the correlation between heights is .75. Next, in the center plot, we add a relatively small amount of random error such that the observed height that is measured for each brother will typically deviate from its actual value by 1 inch in either direction. Finally, in the last plot, we increase the error such that the random deviations are twice as large, now up to 2 inches in either direction. In the second and third plots, the range of plausible deviations from each brother’s actual height is represented by dotted intervals around the true values. If we were to draw an ellipse around the cloud of data in each case, it becomes evident that the correlational

FIGURE 6.2

Illustration of the Concept of an Attenuated Correlation.

186 Measurement Error and Reliability

relationship has been weakened, or attenuated. For the simulated data shown in the center plot, the correlation coefcient is attenuated from .75 to .70. In the right plot, a doubling of the error leads to a much larger attenuation of the correlation coefcient from .75 to .45. The correction for this attenuation due to measurement error through the use of Equation 6.1 hinges on the availability of the two reliability coeffcients8 in the denominator. Let’s consider how Spearman went about this in applying the formula to disattenuate the correlations between measures of discrimination and intelligence observed for his village school sample. Now, what he wanted for this purpose was the average correlation of several independently obtained values of the measure of interest, whereby each of these values could be regarded as a replication of the same measurement procedure. Unfortunately for Spearman, his three mental tests of pitch, light, and weight discrimination had been administered just once. Only for his competing measures of intelligence did he have something that bore some faint resemblance to independent replications, and these came from the two sets of ratings for “common sense outside of school” provided by the two oldest children in the school. Spearman computed a correlation between these two ratings of .65, took this as an estimate of reliability, and, since it was the only one available, used it as a proxy for the unknown reliability of his mental tests. Now, as shown in Table 6.2, Spearman had originally found an observed correlation between pitch discrimination and school cleverness of .25.Thus, the adjusted correlation came to rx y =

.25 = .39. .65 * .65

Applying the same method to disattenuate the correlations between cleverness and his other two mental tests (light and weight discrimination), the two observed correlations of .47 and .37 increased by quite a bit more, to .73 and .58. By the same token, the average correlation between mental tests and each holistic ranking of common sense increased from .38 to .58. As audacious as it was to make these adjustments with only a single crude proxy for reliability available, Spearman was just getting started. Because the rankings from each of the three mental tests in the village school sample generally had similar correlations with the intelligence ratings, he concluded that each of the three mental tests could be used interchangeably as a measure of sensory discrimination. So he took the average intercorrelation between these three measures, which was .25, as the best estimate of the reliability of any measure of discrimination. By a similar rationale, he regarded the ratings of cleverness and common sense as interchangeable measures of intelligence and took the average intercorrelation between them, which was .55, as the best estimate for their reliability. He then applied the disattenuation formula once again using these new reliability coefcients, but this time to the average observed

Measurement Error and Reliability

187

intercorrelation between measures of discrimination and measures of intelligence as follows: rx y =

.38 = 1.01. .25 * .55

He would apply a similar approach to adjust the observed correlations from his prep school sample9 and come to a similar result. This was the initial basis that Spearman provided for the conclusion that mental tests of sensory discrimination and holistic ratings of intelligence measure the same thing.10 Putting aside for the moment the black magic that had turned an observed correlation of .38 into one that was now (impossibly) more than perfect, there was genuine wisdom in Spearman’s thesis that one should be careful about taking observed correlations at face value. These correlations could be distorted not only by measurement error but also by what we would today describe as the infuence of confounding variables (Spearman described these as irrelevant variables, e.g., age and practice) and restriction of range (i.e., having a sample that is more homogeneous than the target population). This was how he reconciled his fndings with those of Wissler, essentially arguing that Wissler’s correlations between mental tests and intelligence had been dramatically attenuated by a combination of extreme measurement error and restriction of range.11 The idea that “faulty” correlations can be corrected to their true values not by making improvements to the measurement procedure but by statistical adjustment may seem a bit too good to be true. And the fact that Spearman had generated adjusted correlations that were more than perfect raised some prominent eyebrows (in particular, see Pearson, 1904). With this in mind, we now need to look more closely at both the assumptions that were underlying Spearman’s disattenuation formula, in general, and his reliability coefcients, in particular, and whether these assumptions are plausible.

6.4 6.4.1

Replications, Occasions, and Measurement Error Yule’s Proof

It is in the assumptions underlying Spearman’s disattenuation formula that the linear error model of classical test theory frst comes into view. Although these assumptions may have been implicit to Spearman’s initial application of the formula as described earlier (Spearman, 1904c, 254–255, 271), and again when he published a proof for the formula a few years later (Spearman, 1907), it was really not until a proof of the attenuation formula, written by Udney Yule, that we see the appearance of an equation in which a formal distinction is made between an observed value, a true value, and measurement error. Yule had sent this proof to Spearman in a private letter in October 1908, and Spearman included it as “appendix e” of Spearman (1910).12 I include the entirety of Yule’s

188 Measurement Error and Reliability

short proof here with some commentary (in his original compact notation, it is easy to overlook some important details). Yule’s proof begins as follows: x1 and y1 are measures of x and y at a certain series of measurements, x2 and y2 are measures of x and y at another series of measurements. Let x1 = x + δ1 , x 2 = x + δ2 , y1 = y + ε1 ,

y2 = y + ε2 ,

all terms denoting deviations from means. Implicit to Yule’s statement that x1 and y1 and x2 and y2 are measures of x and y “at a series of measurements” is the notion that a common measurement procedure is being applied one at a time to a series (or set) of objects for the purpose of characterizing individual diferences with respect to attributes x and y. That is, each of the four equations apply to a specifc object, such that in place of x1 = x + δ1 , we could write xi 1 = xi + δi 1 , xi 2 = xi + δi 2 , and so on, with the subscript i indexing the object of measurement (where in this context the “object” is a person). Also left unsaid is the premise that {x1 , x2} and {y1 , y2} are the observed values that result from two replications of the same measurement procedure and that {x, y} are the true values. They are true values in the mathematical sense that the values remain fxed across these replications. But more important, what Spearman had in mind as a “true value” was an actual value for each attribute that existed independent of the eforts to measure it, and as such, this represents one of the very frst distinctions between an observed variable (e.g., x1) and a latent variable (e.g., x) in psychometric research. Now we come to Yule’s brief but explicit statement and assumptions about measurement error: Then, if it is assumed that δ, ε, the errors of measurement, are uncorrelated with one another or with x or y, ∑ xδ etc. = 0, ∑ x1y1 = ∑ xy. The two expressions above involve the sums of products taken over the series of objects i = 1, 2, . . . , N being measured. Since all terms represent deviations from means (taken over the objects), the expressions represent assumptions about covariances, previously seen in the numerator for the Pearson product-moment correlation shown in Equation 6.1. Note that one could just as easily express the covariance between any two terms (i.e., ∑x1y1) as the product of a correlation coefcient (i.e., rx1y1) and standard deviations (i.e., σx 1 σ y 1 ), and this is what Yule does next for each of the four covariances of observed measures {x1 , y1}, {x1 , y1}, {x1 , y2}, and {x2 , y1}. Given the assumption he was making of

Measurement Error and Reliability

189

uncorrelated errors of measurement, each of these covariances should be equal to the covariance between the true values of the two measures, {x, y}. Hence, rx1 y1 σx1 σ y1 = rx y σx σ y , and similarly, rx2 y2 σx2 σ y2 = rx y σx σ y , rx1 y2 σx1 σ y2 = rx y σx σ y , rx2 y1 σx2 σ y1 = rx y σx σ y , 4 xy

r = rx1 y1 rx2 y2 rx1 y2

or

σx21 σx22 σ y21 σ y22 σx4 σ y4

.

The culminating steps of Yule’s proof invoke the assumptions of uncorrelated σx2 σx2 σ y2 σ y2 errors of measurement to simplify the expression rx4y = rx1 y1 rx2 y2 rx1 y2 1 42 41 2 . σx σ y But also, since ∑ xδ = 0, ∑ x1 x 2 = ∑ x 2 , and rx1 x2 σx1 σ y1 = σx2 or σx1 σx2 = or rx4y = rxy =

4

σ y2 σx2 and σ y1 σ y2 = , ry1 y2 rx1 x2

rx1 y1 rx2 y2 rx1 y2 rx2 y1 rx21 x2 ry21 y2

rx1y1 rx2 y2 rx1y2 rx2 y1 rx1x2 ry1y2

,

.

In words, the final expression of Yule’s proof, rxy =

4

rx1 y1 rx2 y2 rx1 y2 rx2 y1

, shows that rx1 x2 ry1 y2 the true correlation between two variables that have been measured twice can be found by dividing the geometric mean of the four unique intervariable correlation coefficients by the geometric mean of the two intravariable correlation coefficients. Notice that Spearman’s attenuation formula is making the further simplifying assumption that rx1 y1 = rx2 y2 = rx1 y2 = rx2 y1; hence, if we let rx ′ y ′ represent

any observed correlation between two measures of the variables x and y on any rx ′y ′ single occasion, then it would follow that rxy = . rx ′x ′ ry ′y ′

6.4.2

Thought Experiments and Shots Fired

There is a simplicity and elegance to Spearman’s insight and Yule’s proof that remain appealing over a century later. But the proof begs an important conceptual question; namely, in what sense, and under what conditions, can a measurement procedure be replicated? If a replication requires a different temporal

190 Measurement Error and Reliability

occasion, when this happens, in what sense can diferences in repeated measurements be attributed to random errors, as opposed to real changes over time? More to the point, for a measurement to be replicated, what about the measurement procedure is being held constant, and what is being allowed to vary? It turns out that these questions are not so easy to answer, and if Spearman gave this much thought, it was not conveyed in his writing. One way to appreciate the problem is to express Yule’s opening equation for a single variable, x, in this more general form: xi′j = xi + δi j ,

(6.4)

where, as before, the subscript i indexes the object of interest (i = 1, . . . , N) and the new subscript j indexes a unique temporal occasion on which a measurement of xi is taken. On each occasion, some quantity, δi j, is added to the true value xi. If we assume that δi j is both uncorrelated with xi and uncorrelated across occasions, then each realization of δi j represents a random, occasion-specifc “error” of measurement. It seems the reason we think that repeated measurements are prone to error must have something to do with the interaction between the measurement procedure and the particular moment in time when the procedure is carried out. But what is this interaction, exactly? Let’s play this out with the example of measuring the length of an inanimate object. Consider some set of N tables. The frst table in the set has a known length of x1, the second one has a length of x 2 , and so on to xN . Imagine that I have been given instructions and training to enact a procedure in which I am to measure the length of a single table (i = 1) using a 1-meter-long ruler with graduated units demarcated in centimeters. This is to be done by frst counting the number of times I need to concatenate the ruler before the remaining length is less than the ruler. At that point, the graduations along the ruler are used to fnd the remaining length in centimeters. The total length of the table is recorded after adding together the lengths I have concatenated. Now, if I were asked to repeat this identical procedure 100 times (i.e., J = 100)—and if I were actually willing to do it—why would we expect the length I record each time to vary? After all, the length of the table has not changed, and neither has my ruler. The only possible answer, it seems, is that when a new measurement is taken on a diferent temporal occasion, it triggers a metaphorical roll of the die or fip of the coin inside my head. Perhaps the activity always triggers the same collection of neurons in my brain into action, but there is some randomness to how this happens that is specifc to a given moment in time, and this randomness infuences the measurement procedure. When I use the ruler to estimate the length of a table, at the point when I assign a number to represent the table length, a decision needs to be reached about the exact number that is the best match, and there is some randomness to that decision that represents “randomness in my soul.”13

Measurement Error and Reliability

191

Note, however, that in this simple example, errors in measurement come not from anything specifc to the object of measurement (i.e., the table) but from the person enacting the measurement procedure and making repeated, fallible judgments. That this is how Spearman seems to have been thinking about measurement error and reliability coefcients when he frst introduced his attenuation formula can be inferred from the following example14 taken from Spearman (1904c, 271): A target was constructed of a great many horizontal bands, numbered top to bottom. Then a man shot successively at a particular series of numbers in a particular order; clearly, the better the shot, the less numerical diference between any number hit and that aimed at; now just as the measurement of any object is quite appropriately termed a “shot” at its real value, so conversely, we may perfectly well consider the series of numbers actually hit in the light of a series of measurements of the numbers aimed at. When the same man again fred at the same series, he thereby obtained a new and independent series of measurements of the same set of objects (provided, of course, that there be no appreciable constant error). At the risk of being pedantic, it may help to make Spearman’s example even more concrete by specifying that rather than “a great many” horizontal bands, there are 100, each band is 3 inches in height, and the target is positioned 50 meters away from the man taking shots. See Figure 6.3 for an illustration. The measurement procedure in this example involves the man taking hold of a rife that has been secured to a pole that is perpendicular to the ground. The rife can be moved, within limits, vertically but not horizontally. When in the

FIGURE 6.3

Spearman’s Band-Shooting Example.

192 Measurement Error and Reliability

lowest/highest position, the rife will always hit the lowest/highest number in the band (a 1 or a 100). The unique series of numbers (say, N = 30) that the man is given to shoot are in random order between 20 and 80, so every target involves an active judgment with regard to the height the gun needs to be raised or lowered. The attribute in Spearman’s scenario is always the number on the target, xi , and it takes on possible values between 20 and 80. An observed value the man shoots on the frst occasion is xi′1; an observed value on the second occasion is xi′2. Now, provided that the shots at the same band across the two occasions can be conceptualized as independent instances of the same event, it follows from the assumptions in Yule’s proof that the correlation between these two observed sets of numbers, r (xi′1 ,xi′2 ), can be interpreted as an indication of the consistency of the measures, and if consistency is high, then a large proportion of the variability we observe in the numbers shot on any one occasion will be refective of real diferences in the numbers that were targeted. The reason this correlation is not perfect follows from the same reasoning as applied in the table measuring example. On each unique occasion that the man takes aim at a given target, there is some distinct amount of internal randomness that infuences the shot (i.e., the measurement). It is this interaction between occasion and shot that we mean when we use the term “error” and identify it with the symbol δi j in Equation 6.4. Spearman then introduces the results from a second shooter: Next a woman had the same number of shots at some set of numbers in a similar manner. If, then, our above reasoning and formula are correct, it should be possible, by observing the numbers hit and working out their correlations, to ascertain the exact resemblance between the series aimed at by the man and woman respectively. In actual fact, the sets of numbers hit by the man turned out to correlate with those hit by the woman to the extent of 0.52; but it was noted that the man’s sets were correlated with one another to 0.74, and the woman’s sets with one another to 0.36; hence the true correspondence between the set aimed at by the man and that aimed at by the woman was not the raw 0.52, but .52 .74 * .36

= 1.00,

that is to say, the two persons had fred at exactly the same series of bands, which was really the case. In introducing the second shooter, Spearman was, in efect, introducing a second linear measurement equation, yi′ j = yi + εi j ,

(6.5)

Measurement Error and Reliability

193

but imposing the restriction that yi = xi since the man and woman have been given the same series of numbers to shoot in the same order. That is, in Spearman’s example, the two sets of objects on each occasion j are, by defnition, identical, and the true adjusted correlation is perfect, since they are indeed “measuring” the same set of numbers. The assumptions of uncorrelated errors from Yule’s proof seem fairly plausible in Spearman’s band-shooting example. That is, we can imagine that each time either the man or woman takes a shot at each numeric target, he or she may aim a little too high or too low, and this amount can be conceived as if it were a random draw from a distribution with a mean of zero and a constant standard deviation. To the extent that every shot requires the shooter to make a judgment as to where to position the rife, it seems reasonable to imagine that the specifc value on the target has no (or at least negligible) infuence on the amount of error. Finally, the measurement error specifc to the man when he shoots at a given target on a given occasion should have no infuence on the measurement error specifc to the woman when shooting at the same band on the same occasion.15 This conceptualization of measurement error as attributable to errors in the repeated temporal observations of an observer, errors distinct from the object being measured, is consistent with the way that mathematicians and statisticians had thought about measurement error during the 19th century (e.g., see ch. 8 in Boring, 1950). Given that all the measures of academic ability available to him in his frst village and prep school samples involved judgments and rankings from an external observer, it would have been an obvious parallel for Spearman to make the analogy between a teacher and an astronomer, on one hand, and the perception of a child’s intellect and the location of a celestial object in the sky, on the other. But what about measures such as his mental tests of pitch, weight, and light discrimination? And what about other, more complex mental tests such as those that had been developed by Binet? Can we fnd two or more independent measurement replications in these contexts as well?

6.4.3 The Problem of Defning a Unique Measurement Occasion Spearman’s fnding of a more than perfect disattenuated correlation between mental tests of sensory discrimination and traditional judgments of intelligence had drawn the ire of Karl Pearson, primarily because Spearman had made the tactical mistake of using a high-profle study that Pearson was just publishing to illustrate the ease with which correlational fndings could be misinterpreted when they are not disattenuated. As part of a terse response to Spearman’s criticism, Pearson (1904) noted that if Spearman was fnding and using correlations near .20 between independent observations (i.e., reliability coefcients) in his disattenuation formula, then he “must have employed most incompetent observers, or given them most imperfect instructions, or chosen a character suitable for random guessing rather than

194 Measurement Error and Reliability

observation in the scientifc sense” (160) Much of what Pearson had to say in his response was both openly contemptuous and petulant and helped mark the onset of a lifelong feud between the two men.16 But when it came to raising questions about Spearman’s methods of imputing reliability coefcients for use in his disattenuation formula, Pearson seems to have been shooting at the right target. In stark contrast to the band-shooting example, there had been nothing that came close to approximating the ideal of replicated measurements for the mental tests Spearman had used with his village and prep school samples. And the structure of the formula sets up a perverse incentive for a researcher, wanting to fnd a strong adjusted correlation, to seek out unreliable measures. Spearman had argued that the adjusted correlation between mental tests of sensory dis.38 = 1.01. But if the crimination and rating of intelligence was rx y = .25 * .55 reliability of the mental tests had been .50 instead of .25, the adjusted correlation .38 = .72 . And if the reliability had been .80, the adjusted would be rx y = .50 * .55 .38 = .57. If the formula were to have a correlation would be just rx y = .80 * .55 more defensible application, clearly it required a more defensible method of producing reliability estimates. One person who took on this task was William Brown (1881–1952). As a doctoral student at University College London, Brown set out to improve on Spearman’s study by administering a broader range of mental tests and doing so with subjects tested on two diferent occasions 24 hours apart. In 1911, Brown added some supplementary chapters to his dissertation exploring the correlations between mental tests, and what emerged was a textbook ofering a didactic introduction to both the experimental methods of psychophysics and the correlational methods used to study individual diferences. He titled the book, one of the frst of its kind, The Essentials of Mental Measurement. In this book, Brown (1911) criticizes the applicability of Spearman’s disattenuation formula, and in the process, Brown raises some questions about the proper interpretation of hypothetical variability across repeated measurements of an individual taking a mental test: In the case of almost all the simpler mental tests the quantities δ and ε are not errors of measurement at all. They are the deviations of the particular performances from the hypothetical average performance of the several individuals under consideration. Thus they represent the variability of performance of function within the individual. When an individual in the course of three minutes succeeds in striking through 100 e’s and r’s in a page of print on one day, and 94 under the same conditions a fortnight later, there is no error of observation involved. The numbers 100 and 94 are the actual true measures of ability on two occasions. The average or mean ability, which is the more interesting measure for the purposes of correlation, is doubtless

Measurement Error and Reliability

195

diferent from either, but that does not make the other two measures erroneous. Evidently in these cases δ and ε represent individual variability, and to assume them uncorrelated with one another or with the mean values of the functions is to indulge in somewhat a priori reasoning. (86) Here we see Brown articulating what seems to be a fundamentally diferent interpretation of the terms in the two expressions generalized from Yule’s proof xi′ j = xi + δi j and yi′ j = yi + εi j. Instead of interpreting xi and yi as true scores in the physical sense implied in Spearman’s band-shooting example, Brown characterizes them as the “hypothetical average performance” of each individual. What Brown describes here as the “actual true measures” are the observed values, xi′ j ′ and yi j which are themselves the sum of the hypothetical average and a real, occasion-specifc deviation that represents an individual’s ability at a particular point in time. Brown thus considers the assumptions from Yule’s proof that the x’s are uncorrelated with the δ’s, that the y’s are uncorrelated with the ε’s, and that the δ’s are uncorrelated with the ε’s, an indulgence in “a priori reasoning.”17 There was much in need of elaboration here that Brown failed to provide. What causes δi j and εi j to be nonzero? If this is not a refection of “an error of observation,” as Spearman had originally cast it, and a refection of real diferences in individual ability at a single point in time, why would they be expected to vary? And perhaps most important, is there a distinction between the “hypothetical averages” xi and yi, which have been operationally defned by the choice of task on a mental test, and the underlying mental abilities they were designed to measure? To a great extent, Brown’s interpretation of δi j and εi j is still consistent with the earlier explanations ofered for the diferences we would observe with repeated measurements of the length of a table or shots to a target. Whether we wish to describe the variability as “error” or as an individual diference, it may be plausible to regard it as if it were a random interaction between a human judgment and a unique temporal occasion. If so, then if it were possible to observe the results from two independent replications of the same mental test, surely we could get defensible estimates of reliability and fnd our way clear to properly apply Spearman’s disattenuation formula. But are independent replications ever attainable for mental tests? Probably not. Consider the example of a mental test of short-term word recall. Say that we give a person a list of 10 vocabulary words, give the person 30 seconds to memorize the list, remove the list, wait another 30 seconds, and then ask the person to recall as many words as possible. It turns out that the person is able to recall 5 words correctly. If we were to repeat the exact same procedure with the same person the next day, the idea is that we expect any change in the words recalled to be caused by some inherent randomness in the cognitive process invoked any time a person is asked to perform the activity of memorizing a list of words, in the same sense that there was some inherent randomness

196 Measurement Error and Reliability

in the process I would use to estimate the length of a table. It would be natural to think that a person would be expected to improve in the number of words recalled from one day to the next, either due to practice with the procedure, increased familiarity with the words, or both. Yet this indicates a mismatch between the experiment envisioned and the experiment we can actually conduct. The experiment we envision is one in which the person’s base cognitive state is unafected by repeating the same measurement procedure. So if, in fact, the reason a person’s word recall score would change is at least in part because of practice or familiarity, we have failed to produce an independent replication of the procedure. The only true replication of the procedure we could fully trust would be one in which it is possible to brainwash the person to forget the initial testing experience.18 The further the realized experiment departs from this ideal, the harder it becomes to claim that one can distinguish between true variability and variability due to error—at least as we have conceived of it here.19 The upshot of all this is it is exceedingly hard to argue it would be plausible for repeated measurements of a mental test to constitute true replications for anything but the simplest mental test. If we make the time increment between occasions too short, then it becomes less plausible that a person’s performance from one occasion to the next can be appropriately cast as being infuenced by two temporally distinct metaphorical fips of a coin. Furthermore, the shorter the time increment between occasions, the more likely that the person either experiences fatigue or becomes overly familiar with the test, either of which can confound the measurement procedure. If we make the time increment too long, then the mental attribute is more likely to change between occasions, again confounding the measurement procedure. The solution? Reconceptualize what it means to replicate a measurement procedure.

6.5 Varying Test Items and the Spearman–Brown Prophecy Formula It was during this period of controversy that followed the publication of Spearman’s 1904 papers that Spearman and Brown independently formalized a new approach to estimating a reliability coefcient that involved a novel conceptualization of a measurement replication (Spearman, 1910; Brown, 1910). The linchpin of this new approach was a formula that became known as the Spearman–Brown “prophecy” formula, and the generalized version of it can be written as rn =

nro , 1 + (n − 1) ro

(6.5)

where ro is the reliability of an existing test (the subscript o stands for original), n is a factor by which the length of a test has been increased, and rn is the reliability that would be predicted for this lengthened test (where the subscript n stands for new).

Measurement Error and Reliability

197

0.6 0.4 0.0

0.2

New Reliability

0.8

1.0

Doubling Test Length

0.0

0.2

0.4

0.6

0.8

1.0

Old Reliability

Using the Spearman-Brown Formula to Show the Impact of Doubling Test Length on Reliability.

FIGURE 6.4

Figure 6.4 illustrates the use of the Spearman–Brown formula for hypothetical tests with original reliability coefcients ranging between 0 and 1. Notice that the relationship is nonlinear so that the impact depends on the original reliability of the test. For example, doubling the length of tests with original reliability coefcients of .2, .5, and .7 would be predicted to result in tests with new reliability coefcients of .33, .67, and .82, respectively. One of the more useful features of the formula is that with a little algebra, it can be inverted to answer the question, “How much longer will a test need to be if we wish to attain a desired level of reliability?”: n=

(1− ro )rn . (1− rn )ro

(6.6)

A core premise that led to the derivation of the Spearman–Brown formula is that any “test” can be regarded as a composite of multiple “subtests.” In the limiting case of the sort of mental tests that Galton and Binet had pioneered, each test was typically composed of a single task. But if one goes back further to Fechner’s psychophysical experiments, tests of sensory discrimination were typically composed of multiple tasks in the form of subjects comparing diferent pairwise combinations of focal and reference stimuli. The same had been true of Spearman’s original mental tests of pitch, weight, and light discrimination—a child’s score was based on the performance across multiple tasks. We can express this basic idea as xi′j = ∑kK xi′j k ,

(6.7)

198 Measurement Error and Reliability

where the subscript k indexes a unique subtest. This equation defnes what it means to speak of any given test having a length of n as in the Spearman–Brown formula. The length n of any test can be increased by adding more subtests, whereby the smallest possible subtest would be a single task, question, or, in the parlance of psychometrics, a test item. If any test is composed of two or more unique subtests, it might be plausible to imagine that each one can be regarded as an interchangeable miniature version of the larger test. In this sense, each subtest might be regarded as the attempt to replicate the same measurement procedure by holding the temporal occasion constant but by allowing a diferent aspect of the measurement procedure, the item, to vary. This, naturally, changes what is meant by a measurement replication. The mathematical insight that led to the Spearman–Brown formula was ingenious. So long as the parts that make up a composite test score can be assumed to follow the same fundamental linear equation that defned the composite, namely, that xi′j k = xi j + δi j k , bringing along all the same assumptions about uncorrelated errors, then it will be the case that any time the length of a test is doubled, the variability in the total test score across test takers that is attributable to real diferences should increase by twice as that which is attributable to measurement error.20 But what is a measurement error in this context? If a replication is now defned by varying k but holding i and j constant, then why is it that for the same person, we would not expect δi j 1 = δi j 2 = . . . = δi j K = 0? The answer, in my view, has to be that “error” now comes from the unique interaction of a person with each subtest as opposed to a diferent temporal occasion. It is the choice of item or items used to defne each subtest that produces the “as if ” random event that will be to the beneft of some people and the detriment of others. Brown applied Equation 6.4 by taking an existing test given on two occasions and then predicting how much higher the reliability—the correlation between test scores across occasions—would be if it were possible to combine test scores for each student across the two occasions. Importantly, however, the two shorter tests being combined were not identical, as each was composed of a slightly diferent underlying task. For example, Brown had administered a mental test in which students were asked to cross out designated letters in a passage of prose, but in doing so, he regarded the passage selected as irrelevant so long as it was approximately the same length with approximately the same opportunity to cross out designated letters. It seems clear that Brown regarded asking a student to complete the same task using two comparable passages as essentially the same measurement procedure. This was also the case for Spearman, who, in contrast, to Brown had worked backward in applying the formula, imagining an existing test that could be decomposed such that the score was expressible as a composite of two or more parts. As his example, Spearman described a test composed of a list of words for students to spell that could be split in half. Given a list of 20 words to spell, surely one could divide the list

Measurement Error and Reliability

199

into two parts that, at least from the perspective of the tester, would be regarded as interchangeable. But from the perspective of each child taking a test, one list may be easier to spell than the other for reasons that are unpredictable. In summary, Spearman’s preferred new approach for estimating a reliability coefcient was straightforward, and it appeared to circumvent the practical and conceptual challenge of fnding and defning commensurate measurement occasions: 1. 2. 3.

Divide a single test ofered on a single occasion into two or more parts. Compute the average correlation between these parts. Use Equation 6.4 to predict what the correlation would have been if it had been possible to observe a replication of the single test before it was divided into parts.

The conceptualization of a test as a composite made up of smaller subtests seems to provide a novel remedy to the replication problem in the context of measuring psychological attributes. That is, Spearman’s approach redefnes the replication of a measurement procedure as one in which it is the instrument itself that is varied while the occasion of the procedure can be regarded as having been either completely fxed or defned at such a small grain size (e.g., the minutes it takes to complete a subtest) as to be considered negligible. As I have noted, this move has some interesting ramifcations for the interpretation of measurement error. Before, when the occasion was varied and the procedure was fxed, measurement error stemmed from the interaction between a human cognitive process and temporal occasion, the notion being that as time varies, so does some aspect of the process that fgures into a human judgment in answering the questions on a mental test. But if occasion is held constant and it is the test procedure that is varied, then measurement error is better understood as a consequence of the interaction between human judgment and the particular subsets of items a person sees on a test.

6.6

The Development of Classical Test Theory

Interestingly, Spearman showed little interest in elaborating this line of research. He viewed the two contributions that still bear his name today, the Spearman rank correlation coefcient21 and the Spearman–Brown formula, as “minor researches” (Spearman, 1930). The same can even be said for his disattenuation formula. After introducing it to the world in Spearman (1904b), applying it in Spearman (1904c), justifying it in Spearman (1907), and defending it in Spearman (1910), he seldom applied it, let alone reconsidered the plausibility of its assumptions during the balance of his career. As far as Spearman seems to have been concerned, these methods had represented the initial path he had taken in pursuit of some network of laws that could be used to explain individual

200 Measurement Error and Reliability

diferences in psychological attributes. The development and evaluation of such theories seemed to hinge on the computation of correlation coefcients collected from less-than-ideal conditions, and Spearman’s methods represented attempts to make adjustments to these coefcients that could account for small sample sizes, the unavailability of direct measurements, and measurement error. But what Spearman discovered in the process was that he did not actually need to disattenuate pairwise correlation coefcients one at a time to make the case that measures of discrimination and academic intelligence had something fundamental in common. For this, he soon concluded it was not the magnitude of any pair of correlations that mattered, but the pattern of correlations across many diferent pairs of measures. The method that stemmed from this insight, factor analysis, and how Spearman used it to support a general theory of intelligence, will be the subject of the next chapter. I would argue that Spearman’s disattenuation formula has had two very lasting consequences for the study of individual diferences in the human sciences, one that I think was intentional and one that was not. Spearman was intentionally making the point that there was almost always going to be a diference between numeric quantities that researchers claim to measure or attempt to compute and a true numeric quantity of interest. This diference in true and observed was somewhat loosely attributed to “errors in observation.” What Spearman demonstrated was that with this in mind a high-quality study design depended on not just procuring a suitably large sample size but also on the consistency of the underlying measurement procedure(s), and it was the measurement procedure that came frst. That is, if the underlying procedure were unreliable, all observed correlations would be attenuated by an unknown amount, and this could not be salvaged by increasing the sample of individuals engaging in the procedure. Spearman ofered a practical solution to the problem. If a reliability coefcient for the procedure was available, it would be possible to get a good estimate for the counterfactual quantity—the correlation we would have observe in the absence of measurement error.22 Unfortunately, independent replications of mental tests were, and still are, hard to fnd, leading Spearman to perform some remarkable gymnastics of reasoning to fnd suitable proxies for the initial application of his disattenuation formula. This challenge, and criticism from Karl Pearson and William Brown, led Spearman to introduce a method for estimating a reliability coefcient that no longer required the administration of an identical measurement procedure on two or more occasions. The basis for a reliability coefcient became the replication of test items in place of the replication of test occasions. This move has produced a disconnect between the way that the concept of reliability is conceptualized in everyday language, and the way that it is most commonly estimated in the human sciences. In everyday language, when one speaks of a measurement procedure with low reliability, this is taken to mean that it produces inconsistent values for the same attribute of an object across diferent temporal

Measurement Error and Reliability

201

occasions of measurement. When presented with a low reliability coefcient for a test that has been designed to measure a human attribute, one would be tempted to give it the same interpretation as revealing a lack of consistency attributable to the occasion of measurement. To follow this logic, if we gave any person the test on a diferent occasion, we might see a very diferent score, and by extension, the ordering of people in a distribution of test scores would be expected to change considerably across occasions. In fact, all that we should infer from a low reliability coefcient is that there is a signifcant interaction between the choice of test items and the attribute of measurement. What would happen if we could give the exact same test with all conditions constant and only the temporal occasion changing? We really have little idea. It was left to others to pick up the thread of Spearman’s “minor researches” to establish a formal theory around reliability and measurement error and situate it within a theory of test response. These sorts of issues would come into greater focus as taken up by Kelley (1921, 1942), Thurstone (1931a), Guilford (1936), Kuder and Richardson (1937), Gulliksen (1950), Guttman (1945, 1953), Cronbach (1947; 1951), Tryon (1957), and Novick (1966), culminating in the comprehensive treatment of classical test theory in Statistical Theories of Mental Test Scores (Lord & Novick, 1968) and the Dependability of Behavioral Measurements (Cronbach, Gleser, Nanda, & Rajaratnam, 1972). It was in the latter case that Cronbach and colleagues can be credited with the important conceptual breakthrough of recognizing that the “error” of classical test theory depends on the design of a measurement procedure and the extent to which this design makes it possible to disentangle the contributions of diferent anticipated “facets” of error. In appreciation of this, Cronbach et al. (1972) introduced the more encompassing terms generalizability and dependability (as well as their mathematical formulations) as the more appropriate considerations for “behavioral measurements” than the classical conception of reliability in the physical sciences. Ironically, however, it is “Cronbach’s alpha” that continues to be the most popular method of quantifying measurement error in the human sciences, even though what it actually quantifes and what most people think that it quantifes can be quite far apart.23

Notes 1 Also known as the “South African War,” spurred by an uprising in the British Empire’s colonial territories in southwestern Africa. 2 In Spearman’s autobiography of 1930, he puts the date at “about 1901,” but it was surely 1902. We know that Spearman returned to his graduate program in Leipzig in December 1902 after spending a few months in England. The Boer War ofcially ended in May 1902, and the dates for when Spearman administered his mental tests at his two principal sites is given as the fall of 1902 (Spearman, 1904c). 3 In the UK academic system, the hierarchy of academic titles proceed from lecturer, senior lecturer, reader, and professor. This is in contrast to the U.S. progression of

202 Measurement Error and Reliability

4 5

6

7 8

9

10

11

12

assistant professor, associate professor, professor, and distinguished professor. For Spearman to be hired as a reader on completing his PhD signaled the recognition he had already earned from his early publications while in Germany, most notably his seminal 1904 papers. As Lovie and Lovie (1996) point out, attainment of a PSC designation was considered an advantageous route for military promotion and career advancement. Spearman (1904c) writes, “As regards the delicate matter of estimating ‘Intelligence,’ the guiding principle has been not to make any a priori assumptions as to what kind of mental activity may thus be termed with greatest propriety. Provisionally, at any rate, the aim was empirically to examine all the various abilities having prima facie claims to such title, ascertaining their relations to one another and to other functions” (249–250). In Spearman (1904c), he refers to the sample of 24 children from the village school as “Experimental Series I,” and the 22 boys from the preparatory school are themselves a subset of his “Experimental Series IV,” representing those students in the school who were taking a class in music. Spearman also computed correlations between measures of discrimination and intelligence for 21 of the remaining children in the village school (“Experimental Series II”) and for an overlapping sample from the preparatory school that he had tested on an earlier occasion (“Experimental Series III”). Spearman placed less stock in these results. One explicit reason was that he had experienced difculties in with the group administration of his mental tests and with the motivation of his subjects. Another may have been that the correlations from these samples provided the least support to this hypothesis. It appears that Spearman had the two oldest children also include themselves in this task as there were no missing values in either set of 24 rankings that he ultimately published. Spearman frst used the term “reliability coefcient” in Spearman (1910). Interestingly, he uses this term not to denote the concept of reliability but to denote its estimation when a test composed of items can be split into two halves: “A very convenient conception is that of the ‘reliability coefcient’ of any system of measurements for any character. By this is meant the coefcient between one half and the other half of several measurements of the same thing” (Spearman, 1910, 281). In his prep school sample, there was just one intended measure of discrimination (pitch), but Spearman argued that music rankings based on school examinations could be used as an alternative and took the correlation between pitch discrimination and music, r = .40, as an estimate for the reliability of discrimination measures. Meanwhile the average intercorrelation among his school subject measures was .71. Therefore, Spearman reasoned, since the average correlation between pitch discrimination and music with subject rankings in the classics, math, English, and French was .56, the .56 = 1.04. disattenuated correlation could be computed as rx y = .40 * .71 “Thus we arrive at the remarkable result that the common and essential element to the Intelligences wholly coincides with the common and essential element in the Sensory Functions” (Spearman, 1904a, 269). The argument was not very compelling. For example, even if one were to apply the questionable estimates of reliability Spearman had used for his village school sample, the disattenuated correlations from Wissler’s study would have only increased from –.02 to –.05 and from .19 to .51. Unfortunately, the proof as reproduced in Spearman’s Appendix E omitted the culminating lines. This was remedied in Brown and Thomson (1921/1940, 158–159), and I use that resource to fll in the missing blanks.

Measurement Error and Reliability

203

13 The idea and wording of randomness in the soul comes from Paul Holland, one of his many great sayings. See Holland (1990) for a great discussion and formalization of these issues. 14 According to Spearman, this example corresponded with an actual experiment he conducted on three diferent occasions, though this strikes me as far-fetched. 15 One of the things that is really interesting about Spearman’s example, although it remains unspoken, is that he was unwittingly identifying more than one source of error in the measurement procedure being described. To be sure, there is a source of error due to the two occasions of measurement (since each occasion triggers a unique random process in the shooter), but there is also a source of error due to the choice of shooter (since the variance in the random process can difer across shooters). The two reliability coefcients he defnes in his example are specifc to the frst source of error and implicitly hold the second one constant. What he is able to do is use these coefcients to disattenuate the correlation between the two shooters; what remains unknown is the average measurement error that one might expect when both the occasion and the shooter are free to vary. This was an issue that had been frst anticipated by Edgeworth (1888), but it would take another 50 years to begin to disentangle with the advent of generalizability theory. 16 See Pearson (1904, 1907) and Spearman (1910). 17 Brown had used the results from his own experiments to demonstrate that Spearman’s assumption of uncorrelated errors across mental tests could be falsifed empirically. For example, as described in Brown (1911), he computed the correlation between the diference in scores across occasions for each measure. If errors in measurement Σ((x1 − x 2 )(y1 − y2 )) = 0. Brown provided were uncorrelated, then rX1 −X 2 , Y1 −Y2 = Σ(x1 − x 2 )2 Σ(y1 − y2 )2 two instances when the correlation was signifcantly diferent from 0. But this can hardly have been considered surprising, as an occasion-specifc correlation across distinct variables had been baked into the design. Brown’s data had come from two sorts of mental tests he had given to samples of 43 adults and 23 school-aged children across two days. The adults had been scored twice for their ability to accurately bisect a series of lines (test x) and then trisect the same series of lines (test y). The children had been scored for their speed (test x) and accuracy (test y) in solving a series of numeric addition problems. Brown’s sample of adults was bisecting and trisecting the same set of lines on two common occasions, and his sample of children was solving the same set of addition problems on a common occasion (i.e., in each case, x and y took place on the same day). 18 Borsboom (2005) has taken this a step further to argue that a replication under the classical model for reliability also requires the de facto assumption of a trip back in time with a time machine, but I disagree, because it is, in fact, the unique temporal occasion that induces the variability we wish to designate as error. If a person were to be brainwashed and sent back to the same moment in time to repeat the procedure, we would expect the exact same outcome. 19 Indeed, even if it were possible, with the beneft of brainwashing, to observe legitimate measurement replications, it is not clear that we should always expect to observe some amount of measurement error. For example, perhaps the process of storing and then quickly retrieving new information (as in the word recall example) is variable (and in this sense “error-prone”), but other cognitive processes are not. If the same person were asked to solve the multiplication problem of 6  × 7 or to identify the

204 Measurement Error and Reliability

20

21

22

23

capital city of Colorado, would our replications return constant values across replications, since the person presumably either knows the answer to these questions or does not? Hence, the interaction between an occasion of measurement and human cognitive judgments may be straightforward in some contexts and more complicated and opaque in others. The proof of this insight and how it leads to the Spearman–Brown formula was expressed most elegantly by Brown in a footnote in his 1910 article, and with somewhat more cumbersome algebra in appendix B of Spearman’s 1910 article. A seminal treatment in the language of random variables can be found in Lord and Novick (1968). In his 1904 addendum, Pearson had criticized Spearman’s computation of productmoment correlation coefcients in his 1904 studies using ranks, and this seems to have led Spearman to propose an alternative formulation appropriate for ranks. Spearman (1906) referred to this formulation as a his “footrule” (i.e., an approximation when a more precise measuring procedure is unavailable). This gist of Spearman’s larger argument, that correlations from suboptimal designs will often need to be remedied by statistical adjustment, was one that had already been advanced by Karl Pearson and Udney Yule in the context of computing partial correlations, but it was Spearman who recognized “errors of observation” themselves as a nonnegligible source of confounding. Cronbach himself was one of the frst to appreciate this and point it out (Cronbach, 1947). In more modern times, see Brennan (2001a, 2001b).

7 MEASUREMENT THROUGH CORRELATION Spearman’s Theory of Two Factors

7.1

Formalization of the Theory of Two Factors

Spearman’s (1904c) hypothesis that all human cognition has something in common is stated explicitly in the concluding sections of ‘General Intelligence,’ Objectively Determined and Measured (hereafter General Intelligence) when he frst writes: Whenever branches of intellectual activity are at all dissimilar, then their correlations with one another appear wholly due to their being variously saturated with some common fundamental Function (or group of Functions).1 (273) This was the genesis of Spearman’s theory of two factors. Figure 7.1 presents a visualization of the theory in the context of the data he had collected from his preparatory school sample of 22 boys. As described in the previous chapter (Section 6.3), these boys had been given a mental test of pitch discrimination on a single occasion and had also been ranked on the basis of their performance on examinations in the subjects of the classics, French, English, math, and music on three occasions. These six measures are represented by the boxes in Figure 7.1. Notice that there are always two arrows pointing toward each of these measures, one that comes from a common source and another from a specifc source, where each source of an arrow is represented by a circle. The arrows that point to each measure are meant to indicate the relative efect of a common and specifc cause, with one signifcant complication: neither cause is directly observable. Spearman’s expression of this theory as a mathematical model DOI: 10.1201/9780429275326-7

206 Measurement Through Correlation

FIGURE 7.1

A Visual Representation of Spearman’s Two-Factor Theory.

did not appear in General Intelligence, but in subsequent publications he would write it as ma i = ra g g i + ra s a sa i .

(7.1)

In this expression, ma i represents a measure of “ability” a for individual i. The choice of terminology, as we will see, is a bit thorny, but we can defne ability in this context as some cognitive attribute of a person that has been, often rather loosely, associated with a designated measurement procedure, for example, a mental test. Spearman regarded the labels given to diferent mental tests as little more than a shorthand summary. One might refer to a test of pitch discrimination as a measure of a person’s ability to discriminate between sounds; similarly, a mathematics test might be described as a measure of a person’s ability to solve arithmetic problems and so on. But, as Spearman (1904c) wrote, while the structure of language necessitates the continued use of such terms as Discrimination, Faculty, Intelligence, etc., these words must be understood as implying nothing more than a bare unequivocal indication of the factual conditions of experiment. For the moment we are only inquiring how closely the values gained in the several diferent series coincide with one another, and all our corrections are intended to introduce greater accuracy, not fuller connotation; the subjective problems are wholly reserved for later investigation. (257–258)

Measurement Through Correlation

207

Here, Spearman is distinguishing between observable measures that represent the “factual conditions of an experiment” and some set of underlying variables “whose real nature is concealed from view.” This is formalized in Equation 7.1, where the observed measure of ability, ma i is expressed as the sum of two unobservable variables, gi , a continuous quantity that does not depend on the choice of tasks within a mental test, and sa i , a diferent continuous quantity that does.2 The contribution of each of these variables is not necessarily the same but can be inferred by the magnitudes of the correlation coefcients rag and ra sa. Importantly, in Spearman’s theory, we see a formalized distinction between a manifest variable, mai and two underlying latent variables, gi and sai. These latent variables were referred to as “factors.” Equation 7.1 is more complicated than it may appear at frst glance, as it represents a system of equations. For example, for each boy in the prep school sample,(i = 1, . . . , 22), we have six equations, one for each measure Spearman had collected: m1i = r1g g i + r1s1 s1i m2i = r2 g g i + r2s2 s2i ... m6i = r6 g g i + r6sa s6i . And across all boys we have a total system of 132 equations. In parallel to the assumptions implicit to Spearman’s approach to disattenuating a correlation coeffcient (considered in detail in the last chapter), it is assumed that within each equation, g is uncorrelated with s, and that across each equation, the s’s are also uncorrelated. In other words, an individual with a high value of g is no more likely to have a high value of s for pitch discrimination than an individual with a low value of g, and having a high value of s for pitch discrimination is not predictive of having a high value of s for mathematics. This second assumption, which implies that specifc factors are uncorrelated, led Spearman to include the proviso that his theory would only apply to abilities that were sufciently “dissimilar” (Spearman, 1904c, 273), which had implications for the design of subsequent studies in that mental tests needed to be purposely written so that they would require a unique specifc ability. A last, but critically important, assumption hidden from Equation 7.1 is that the sample of individuals needs to be similar with respect to obvious confounding explanations for diferences in performance due to age and training (where training could include both education and practice).3 The use of the model required estimates of ra g , the correlations of each test with g . These could be found through iterative application of the general formula Yule had developed for the computation of a partial correlation: rx y|z =

rx y − rx z ry z 1− rx2z 1− ry2z

,

(7.2)

208 Measurement Through Correlation

where rx y|g is the correlation between any two variables x and y after the infuence of variable z has been removed. Now let x = m1 , y = m2 and z = g . By defnition, if the two-factor theory is true, then after the infuence of g has been removed, all that is left are the specifc factors, and these are assumed to be uncorrelated, so rm1 m2|g = 0 . This, in turn, implies an identity that remains at the heart of factor analysis, namely, rm1 m2 = rm1 g rm2 g ,

(7.3)

which indicates that the observed correlation between any two measures is the product of the respective saturations (in factor analytic parlance, loadings)4 of each measure with a common factor. With estimates of ra g in hand, Spearman proposed that actual measures of gi and sa i could be computed as g˘ĝi = rˆ˘a g ma i ,

(7.4)

ˆs˘a i = 1 − ˆr˘a2g ma i ,

(7.5)

and that each of these quantities came with a standard error of measurement (a “probable error”) equal to .6745

7.2

(

1− ra2g

) and .6745r

ag

, respectively.

Method of Corroborating the Theory

The linear model at the heart of Spearman’s two-factor theory can be regarded as a generalization of the linear model found in Yule’s proof of Spearman’s disattenuation formula, or the model formalized in Yule’s proof can be regarded as a simplifed version of the two-factor theory. In Yule’s proof, there are two parallel equations for measures of two diferent variables, x and y: xi′j = xi + δi j , yi′j = yi + εi j , where i indexes a person and j indexes a unique occasion. But we can also express these two equations in terms of Spearman’s two-factor theory as xi′j = rx g g i + rx sx sxi j , yi′j = ry g g i + ry sy s yi j . Written in this way, we see that a person’s observed measure on test x or y is the sum of a fxed value that depends on the latent variable gi and a random value that depends on the latent variable sxi j or s yi j . It was these test-specifc

Measurement Through Correlation

209

factors that gave Spearman the substantive explanation he needed for the observation that diferent pairs of mental tests could have diferent correlations, even if performance on both tests could be attributed to a common mental factor. When conceptualized according to the two-factor theory, the “error” that arises , from using x i j or y ,i j as a measure of gi comes from the interaction of an individual with two diferent features of the procedure: (1) the particular temporal occasion of the test and (2) and the ability of an individual that is specifc to each test. Say we have tests x and y. Now, if the assumptions of this model hold, and if the appropriate reliability coefcients, rx x and ry y, were known, then in principle, one would expect to fnd that the disattenuated correlation between scores on tests x and y would be perfect. This was, in efect, the empirical strategy Spearman had taken in his initial presentation of fndings in General Intelligence (see Spearman, 1904c, 256–272). A problem with this as an empirical corroboration of the two-factor theory, apart from critical questions that could be raised about Spearman’s execution of it, was that it hinged on the availability of defensible estimates of the pertinent reliability coefcients. These estimates could be difcult to procure for the theoretical and practical reasons considered in the previous chapter. What Spearman seems to have realized while looking over the intercorrelations he had tabulated for the measures available from his prep school sample was that if there really was a common intellective function, this would become evident in the pattern of observed intercorrelations, irrespective of the reliability of the measures themselves. That is, if each measure was a function of two independent factors but one that was common across all the tests, the resulting correlation matrix should have two features. First, all the correlation coefcients in the matrix should be positive (a feature that became known in research on mental testing as the presence of a “positive manifold”). Second, and more important, it must be possible to order the correlation matrix in a way that demonstrates the presence of a hierarchical relationship. To see how this works, we turn to a correlation matrix that Spearman used on two different occasions to illustrate the concept, frst in Spearman (1914b) and then in his book The Abilities of Man: Their Nature and Measurement (Spearman, 1927a).5 Table 7.1 shows the intercorrelations between fve measures of cognitive ability that Spearman reproduced from a study by Bonser (1910). The matrix of correlations has been ordered so that measures from top to bottom and left to right are those with the highest average correlations with the other measures (see bottom row). Among the fve measures, mathematical judgment has the largest average correlation with the other measures, followed by controlled association, literary interpretation, selective judgment, and spelling. All the correlations are positive and twice their standard errors (the latter are not shown). Next, notice that if we pick any column (or row), that the correlations tend to decrease in size when going from top to bottom (or from left to right). For

210 Measurement Through Correlation TABLE 7.1 An Example of a Hierarchical Correlation Matrix

 

1

2

3

4

5

1. Mathematical Judgment 2. Controlled Association 3. Literary Interpretation 4. Selective Judgment 5. Spelling Average

– .485 .400 .397 .295 .394

.485 – .397 .397 .247 0382

.400 .397 – .335 .275 .352

.397 .397 .335 – .195 .331

.295 .247 .275 .195 – .253

Source: Spearman (1914a, 1927a), Appendix, p. xii.

example, the correlation of mathematical judgment with the other four measures has the pattern .485, .400, .397, .295. Most relevant to the concept of the full matrix showing hierarchical order, if we pick any two columns and then examine the correlation coefcients for the same two rows such that we have a rectangular array, the correlations in each column should decrease by roughly the same proportion. For example, r (2, 4) .397 r (2,1) .485 = = 1.185. Using Spearman’s original = = 1.213, and r (3, 4) .335 r (3,1) .400 notation, the generalization of this rule is ra p ra q

=

rb p rb q

or, equivalently, ra p * rb q − rb p * ra q = 0.

(7.6)

In Equation 7.6, the subscripts a, b, p, and q correspond to unique numeric indices to give us the fexibility to express the relationships in terms of any desired group of four correlations so long as there are always two columns and rows in common. The diferences being evaluated in Equation 7.6 became known as tetrad (“group of four”) diferences, and one way to prove that a correlation matrix satisfes a hierarchical structure was to compute all possible tetrad diferences6 and show that the resulting distribution was composed of values tightly distributed around 0. In this single example from the Bonser data, the tetrad diference would be .485 * .335 −.397 * .400 = .0037. Why should one expect to fnd the proportional relationship among correlation coefcients shown in Equation 7.6 if the two-factor theory is true? Mathematically, it follows once again as an implication from Yule’s formula for a partial correlation, shown earlier in Equation 7.2. The intuition is that the ordering of the fve rows and columns in a correlation matrix is an indication

Measurement Through Correlation

211

of the degree to which each of the fve measures is saturated with g (where saturation depends on the size of the weight ra g). For example, because correlations with mathematical judgment tend to be larger than those with selective judgment, if the only thing common to the two measures is g, it follows that mathematical judgment must be more saturated with g than selective judgment. However, once we hold the saturation of g in any single measure constant (by examining the pattern in correlations within the same column or row), the most plausible explanation for a change in correlation is the varying saturation of the remaining measures with g. The proportional change in correlation across cells should be the same, irrespective of what column (or row) we examine. In summary, the crux of Spearman’s preferred approach to testing the twofactor theory involved comparing the structure of a correlation matrix that was predicted by his theory to the structure of the matrix that was observed for any particular sample of individuals taking any particular collection of tests. More specifcally, the empirical question that needed to be settled to corroborate the theory was whether the observed tetrad differences were small enough to be explained away as sampling error. From 1904 through 1927, Spearman proposed three different methods that could be applied to evaluate the statistical signifcance of these differences (Hart & Spearman, 1912; Garnett, 1919; Spearman & Holzinger, 1924, 1925, 1930).The best of these came from a method for estimating a standard error of a tetrad difference that was based on a collaboration with his student, Karl Holzinger.7 In the context of the Bonser correlation matrix of Table 7.1, Spearman (1927a) compared the distribution of observed tetrad differences to the standard errors associated with each of these differences, and on these grounds concluded that differences were statistically insignifcant.This was taken as evidence corroborating the two-factor theory. Importantly, this method of testing the two-factor theory did not require any adjustment to observed correlations for attenuation due to measurement error, removing the need to estimate reliability coeffcients for each variable.8

7.3

Building a Model of Human Cognition

In the roughly 20-year period following the publication of General Intelligence and his academic appointment to the University College London, Spearman coordinated two complementary lines of research with the help of students and other collaborators in his psychological laboratory. The frst of these was oriented toward the pursuit of additional data (whether retrospective or prospective) that could be used to establish the universality of the two-factor theory. The fruits of this research culminated in the publication of the 1927 book The Abilities of Man: Their Nature and Measurement. It was in this book that Spearman believed he had made the most compelling case for the validity of the two-factor theory. In the second line of research, Spearman attempted to distinguish and delineate the mental processes that are at the core of all cognitive activities, and hence

212 Measurement Through Correlation

fundamental to any defnition of human intelligence. The results of this work appeared in two diferent places: frst in the 1923 book The Nature of “Intelligence” and the Principles of Cognition and then, with some modifcations and refnements, as part of The Abilities of Man. We will spend some time considering the initial model of human cognition Spearman developed during this period. The model was fairly rough around the edges and largely speculative.9 Nonetheless, the model Spearman presented provides a context for the two-factor theory that is often missing from historical accounts, and it helps underscore two distinguishing features of Spearman’s approach to the study of quantitative psychology. First and foremost, Spearman’s approach was driven by the desire to discover universal laws about human psychology. This, rather than the practical import of the uses to which intelligence tests could be put, was Spearman’s primary career-long motivation. For Spearman, the methods were always in service of the theory (Cattell, 1945; Burt, 1946), and it is surely no coincidence that he never wrote a textbook on the methods of quantitative psychology, even though he is rightfully considered the originating pioneer of many of these methods.10 In contrast, while contemporaries such as Edward Thorndike, Godfrey Thomson, and Louis Thurstone were by no means atheoretical and were also committed to the exploration and discovery of psychological “laws,” they were equally interested in the generalization of the correlational statistical methods they were introducing and the many ways that these methods could be employed to solve practical problems in the human sciences. Second, Spearman was what we might today describe as a scientifc realist, in that he believed that the g and s factors existed independent of the attempts to measure them through mental testing. To this end, Spearman’s model of human cognition shows his willingness to engage in metaphysical speculation about components of cognition that he thought of as complex interactions of neuronal activities in the cerebrum and cortex of the human brain. In this Spearman could not have been more diferent from Karl Pearson, who had no tolerance of metaphysics and believed that the “grammar of science” could only be expressed by mathematizing physical, biological, and behavioral relationships that were directly observable (Pearson, 1892). Spearman, in contrast, believed that quantitative psychology should strive to be explanatory, and not just descriptive. Spearman’s frst complete attempt to envision the components that could collectively explain what is meant by the term intelligence and why individuals difer with respect to it can be found in his 1923 book, The Nature of “Intelligence” and the Principles of Cognition. The individual components were composed of three “qualitative laws” and fve “quantitative laws.” These “laws” are probably best understood as the hypothetical primitives that would comprise a larger meta-theory of cognition, and in fact, Spearman often referred to them interchangeably not as laws, but as “principles.” Still, his use of the term law was a signal of his larger ambitions for psychology to emulate the methods and progress of physics. Figure 7.2

Measurement Through Correlation

FIGURE 7.2

213

Spearman’s 1923 Model of “Noegenetic” Human Cognition.

represents my visual representation of the qualitative and quantitative laws as frst proposed in Spearman’s book. Before we consider each in turn, notice that there are relatively few directional connective arrows drawn in between the components. This is meant to refect the fact that, on one hand, what Spearman was proposing represented a fairly tentative model of human cognition, while on the other hand, he was staking the claim that cognition involved some sort of interaction between these unique components and not others. The three qualitative laws depicted within the rectangular box at the center of Figure 7.2 were the core processes by which humans come to know about the world around them, as well as the socially constructed rules that defne their place in the world. Spearman coined the terms noegenesis and noegenetic to describe the three laws collectively, taking the noe from the Greek word vous, indicating that “processes of this form alone produce knowledge,” and appending genesis to indicate that “such processes alone generate new cognitive content.” Spearman’s three noegenetic laws, in order of implied cognitive complexity were, respectively, the apprehension of experience, the eduction of relations, and the eduction of correlates. Using more modern terminology, we might prefer to describe these laws respectively as the overlapping processes of encoding, inferring, and applying (Sternberg, 1977). What Spearman termed “apprehension of experience” is what we might think of as the simplest form of metacognition. It consisted of the ability to take in a “lived experience” (1923a, 48) and then observe and be aware of the way that one’s mind is reacting to that experience: He not only feels, but he also knows what he feels; he not only strives, but knows that he strives; he not only knows, but knows what he knows. (1927a, 164)

214 Measurement Through Correlation

Through the apprehension of experience, a person comes to recognize and identify what, at least superfcially, appear to be unique features or ideas about the natural and social world. These unique ideas become fundaments, and initial attempts to turn these fundaments into knowledge occur through the second law, the “eduction of relations.” An education of relations takes place when a person has discerned two (or more) fundaments and attempts to discover essential relations that connect them. This is illustrated in Figure 7.2 by the presence of two observed fundaments ( f1 and f2 , solid rectangles) with the unknown relation between them (r in the dotted circle) needing to be inferred. Spearman defned a relation as any attribute which mediates between two or more fundaments. For example, f1 and f2 might be two diferent glasses flled with beer, and the relation to be inferred is a similarity or diference with respect to an attribute such as weight, temperature, volume, bitterness of taste, and the like. Once one or more relations have been inferred, the relations themselves now become new fundaments, and then a relation between relations can be inferred, and so on, in ever-increasing hierarchies of complexity. The third law, “eduction of correlates,” represents something of a twist on the second law in that this represents a scenario whereby a person infers a new fundament after coming to grips with one or more fundaments and a relation. Spearman provided the example of a person with some musical training who hears a note (the given fundament) and is then asked to imagine a new note (new fundament to be inferred) that is a ffth higher (the given relation). Importantly, while the eduction of relations was premised on having previous exposure and experience with the fundaments to be related, the eduction of correlates explained how it would be possible to generate new fundaments. The three noegenetic laws that were at the center of the model make clear that in Spearman’s view, the defning feature of human intelligence was the cognitive process of generating knowledge about the world, and this generative process always involved some combination of internalizing external stimuli and experiences, forming and testing out hypotheses about their relations, and generalizing relations to new situations. Spearman appears to have come to this view through some combination of synthesizing the work of previous philosophers and psychologists and making his own armchair speculations. Did he also tailor his theory to match some of the distinguishing features of the various tasks found on existing intelligence tests or vice versa? It is hard to miss the fact that two of the defning classes of tasks found on intelligence test batteries of the early to mid-20th century were those that attempted to discriminate between a person’s ability to educe relations or correlates. For example, a person could be given the analogy “light is to dark as happiness is to sadness” and then be asked to provide (or select from a set of options) the appropriate common relation between each of the two fundaments. Or a person might be given a diferent variant of the task: dark is the opposite of light, what is the opposite of happiness? The frst task variant requires the eduction of a relation; the second requires an eduction of a correlate.

Measurement Through Correlation

215

The three qualitative noegenetic laws in Spearman’s model were thought to interact to an unknown extent with fve quantitative laws: the law of mental energy, the law of retentiveness, the law of fatigue, the law of conation, and the law of primordial potencies. It was, to be blunt, a minefeld of psychological jargon. The law of mental energy proposed that the amount of energy available for the mind to engage in noegenesis was always constant: “Every mind tends to keep its total simultaneous cognitive output constant in quantity, however varying in quality” (Spearman, 1923a, 131). It follows from this that one cannot simultaneously focus on the apprehension of experience and the eduction of relations with the same amount of energy; instead, just as in physics, energy can manifest itself in diferent forms (e.g., each of the three cognitive processes), and energy can be transformed or transferred from one cognitive process to the other. The law of retentivity held that “the occurrence of any cognitive event produces a tendency for it to occur afterwards.” Spearman characterized retentivity as potentially playing both positive and negative roles in cognitive activities (and by the time Abilities of Man was published he had split these two roles into distinct laws, “dispositions” and “perseveration”). On the positive side was the possibility that the more one was immersed in varieties of cognitive activities, the more that one was likely to develop a disposition to seek out such activities. So, in a best case, retentivity could produce a proclivity to learn for the sake of learning. In a worst case, Spearman identifed the problem of inertia as a cognitive event that “always begins or ends much more gradually than its apparent cause” (1923a, 133). The idea here was that a past experience could produce such a strong impression that it could lead to perseveration, which would either interfere with a person’s ability to treat it as a fundament when educing relations or correlates or be put to the side when apprehending a new experience. The law of fatigue was cast as a counterbalance to the law of retentivity in that the same cognitive activity can only be engaged for a limited time before it is ceased. Just as the body will eventually tire when asked to perform the same repeated activity, so will the mind. The last two components in Figure 7.2 are composed of Spearman’s law of conation and law of primordial potencies. The terms conation and conative are rarely used anymore in modern psychology, having been mostly replaced by the term behavioral. At the time of Spearman’s writing, a prevalent theory of mind was that mental attributes could be characterized as primarily cognitive, afective, or conative. The cognitive and afective attributes produce ideas and feelings, but it is the conative attributes that lead a person to act on these ideas and feelings. In the context of Spearman’s model of human cognition, conation can be defned as the purpose, desire, or will to perform an action. In Spearman’s words, “the intensity of cognition can be controlled by conation” (1923a, 135), an acknowledgment that personal attributes such as self-control and motivation

216 Measurement Through Correlation

can serve to regulate the performance of cognitive activities. Finally, we turn to the ill-named law of “primordial potencies.” What Spearman meant by this was that all individuals possess specifc abilities that can cause, for example, the same person to struggle with the eduction of relations and correlates in one context and excel in the other. In summary, although Spearman is most closely associated with a theory of general intelligence, this theory was situated within a larger model, and that model contained hypotheses about a wide range of interacting human abilities, hypotheses Spearman was constantly revisiting and refning over the course of his career. Interestingly, it was the qualitative noegenetic laws that Spearman placed at the center of human cognition, since he believed these were the processes invoked whenever a person attempts to make sense and generalize from the experiences and phenomena they encounter in their daily activities. But Spearman believed these processes were mediated by other laws that he conceptualized in terms of quantitative variables: mental energy, retentivity, fatigue, conative factors, and specifc abilities. It seems important to point out what has been omitted Spearman’s model: the ability to recall and retrieve acquired knowledge from memory. Spearman distinguished this as part of an anoegenetic process that represented an important psychological attribute but that Spearman wanted to separate from discussions of intelligence.11

7.4 The Interpretation of g So where exactly did the two-factor theory Spearman had initially proposed in General Intelligence sit within the larger system of laws he was formalizing in his 1923 and 1927 books? Spearman equated his general and specifc factors (g and s) with the quantitative laws of mental energy and primordial potencies, respectively. Spearman had frst proposed the interpretation of g as mental energy in 1912 (Hart & Spearman, 1912), and it was an interpretation he maintained throughout his career. He proposed it frst as a literal possibility, and he seems to have never really wavered from this view. Spearman believed that it was the activity of neural structures in the brain that caused measurable individual diferences in g, that the source of such activities could be found in the cerebral cortex, and that it might, in the future, be possible to detect physiological evidence of such activities. He envisioned g as the fuel of cognitive activities; s as the diferentiated engines specifc to the generation of knowledge for a given domain, topic, or context; and conation as the engineer. Hence, people the same level of g could engage in some noegenetic processes more efciently than others because of diferences in their neural engines and conative powers. Figure 7.3 depicts a rather bare-bones model of the workings of g and s in the mind.

Measurement Through Correlation

FIGURE 7.3

217

Spearman’s Conception of Mental Energy.

Spearman would originally explain it as follows: The whole area represents the cerebral cortex, whilst the shaded patch is some special group of neurons (for convenience of the fgure, taken as collected in one neighborhood). The arrow heads indicate the lines of force coming from the whole cortex. In this manner, successful action would always depend, partly on the potential of energy developed in the whole cortex, and partly on the efciency of the specifc group of neurons involved. The relative infuences of these two factors could vary greatly according to the kind of operation; some kinds would depend more on the potential of the energy, others more on the efciency of the engine.  .  .  . (Spearman, 1923a, 6) As Spearman’s realist interpretation of g became one reason others would use for rejecting the two-factor theory, he was also willing to grant it a more metaphorical interpretation (1927a): [T]he hypothesis of mental energy and engines would seem to ft the facts as a glove does a hand. Should, however, any one pedantically still reject the energy on the ground of its being hypothetical, he can salve his conscience by only saying that the mental phenomena behave “as if ” such an energy existed. (127)

218 Measurement Through Correlation

Although g is by now laden with historical baggage and has come to be viewed as synonymous with the measurement of intelligence, Spearman typically sought to distance g from lay conceptions of intelligence. A noteworthy fact about the title of Spearman’s famous 1904 paper introducing the two-factor theory, one that is easy to overlook, is that there are quotation marks around the words General Intelligence. This was not an isolated occurrence and represented a strategy he frequently took when using the term in subsequent publications. To Spearman, the word intelligence, used in isolation, had a dubious meaning, since it had long since become part of the common vernacular, little more than a synonym for the appraisal people make of one another when they come into contact: A person who gives the impression of great intelligence is “clever,” “bright,” or “sharp.” A person who does not is “stupid” or “dull.” But the grounds for such appraisals are usually both unspoken and biased, and Spearman viewed g as, at the very least, a more objective alternative. By the end of his career, Spearman eventually came to the conclusion that what led people to conclude that one person was “sharp” while another was “dull” was primarily the speed and accuracy with which they solved problems and made judgments (Spearman & Jones, 1950) and that this speed and accuracy was primarily a product of mental energy, measurable as g. But in his own writings, Spearman favored the term ability in place of intelligence, perhaps because ability was, at least in the early 20th century, less culturally loaded than intelligence.

7.5 The Utility of the Two-Factor Theory Another reason why Spearman sought to disassociate g from the concept of intelligence was his annoyance at what he took to be the contradictory and atheoretical practices of intelligence testing.12 The industry of intelligence testing that emerged on the heels of the successful administration of the Army Alpha and Beta group tests in the United States had run well ahead of the theory and experimentation that had marked Alfred Binet’s initial development, and among the many psychologists involved in the design, administration, and analysis of these tests, there was little consensus as to the defning attributes that distinguished gradations of intelligence (Thorndike et al., 1927). If two tests of intelligence difered in the tasks devoted to some list of prospective attributes, could they each be declared valid tests of “intelligence”? In his criticisms along these lines, Spearman was already anticipating the problem inherent to a theory of measurement premised on operationalism (Bridgman, 1927; to be discussed in more detail in Chapter 10). If intelligence is no more or less than what intelligence tests claim to measure, then all measures of intelligence are subjectively defned by those who design the tests. The problem with such a state of afairs might seem

Measurement Through Correlation

219

obvious, but it was Spearman who expressed it most forcefully. If the vision for such tests was to use them to place children and adults alike onto educational and occupational tracks that would come to defne their future livelihood and well-being, a case should be built for their validity that went beyond the practical justifcation that the tests are “useful” for these purposes because they have a nonzero correlation with some criterion measure of interest. If any collection of tests could be decomposed into one general and multiple specifc factors, this would seem to provide the common ground needed to settle debates about the proper design of intelligence tests. In keeping with Binet’s original approach, these tests had been designed according to what Spearman described as a “hotchpot” principle: a memorization task here, a verbal reasoning task there, followed by a task of sensory discrimination, and so on. If the two-factor theory were true, then to the extent that such tests allowed for accurate measurements of “intelligence” it was primarily a matter of good luck rather than good design. If a single numeric measure to be interpreted as intelligence were being determined by the proportion of tasks answered correctly and if the number of unique tasks were large, the role of specifc ability factors would tend to cancel, leaving the tester with an adequate estimate of the general ability factor. But surely a better design in this context would be one that purposely selected tasks that were known to be most predictive of g (i.e., have the strongest factor loadings), and g could be estimated more accurately as a weighted composite of task performance. In contrast, for other tests, a preferable design might focus on one or more specifc abilities that have been the object of instruction and practice. In this case, evidence that the test in question had a strong saturation with g would suggest that the test was invalid. So, if the two-factor theory could be corroborated, Spearman believed it held great promise for a more principled approach to the design of mental tests. By 1912, much like Galton, Spearman believed that mental testing could form the backbone of an efcient and meritocratic educational system: So many possibilities suggest themselves, that it is difcult to speak freely without seeming extravagant. The application to psychiatry will be discussed in a separate paper, which we hope to publish shortly. Another obvious one is to relieve the overburdened institution of examinations. These have various natural purposes—the ascertainment [of] whether schools are being conducted adequately, whether pupils are working diligently, whether they are acquiring the instruction needed for particular progressions—which are admittedly incompatible with the further function of measuring individual diferences of ability. The backbone of this ability, the General Factor, the intellective energy, can be disentangled from all such irrelevant matters and submitted to precise experimental

220 Measurement Through Correlation

determination on its own account. This determination is becoming so easy, that it might well be carried out regularly. It seems even possible to anticipate the day when there will be yearly ofcial registration of the ‘intellective index,’ as we will call it, for every child throughout the kingdom. (Hart & Spearman, 1912, 78) Indeed, already convinced that the two-factor theory was true, Spearman had shown an inclination to follow Galton’s footsteps in other ways as well: Still wider—though doubtless, dimmer—are the vistas opened up as to the possible consequences in adult life. It seems not altogether chimaeric to look forward to the time when citizens, instead of choosing their career at almost blind hazard, will undertake just the professions really suited to their capacities. One can even conceive the establishment of a minimum index to qualify for parliamentary vote, and, above all, for the right to have ofspring. (Hart & Spearman, 1912, 78–79) Like most of his British contemporaries in the professional middle classes, Spearman was a eugenicist; both he and his wife joined the Eugenics Education Society prior to the onset of World War I (MacKenzie, 1976; Norton, 1979). But aside from the last sentence of the preceding passage, which was itself coauthored, there are few if any allusions to eugenics in Spearman’s writing, and it seems Spearman was a eugenicist more by culture than by conviction. To be sure, Spearman believed that g was an innate individual characteristic that was inherited and insensitive to environmental stimuli; but he also believed that specifc abilities, although also heritable, could be greatly shaped by experience and instruction, as could the other abilities implicit in his model of human cognition (e.g., self-discipline, efort, endurance, etc.). Spearman’s theory was eclectic enough that it could be used to argue in favor of either more conservative (hereditarian) or more liberal (environmentalist) educational policies. From a conservative point of view (e.g., Galton), more educational resources should be devoted to those children that could be identifed as having the greatest potential by virtue of their high amount of g. From a more liberal point of view (e.g., Binet), those children with lower amounts of g were the ones that required the most attention and resources. Although Spearman surely leaned toward the conservative perspective associated with positive eugenics, his writings in The Abilities of Man show him to be open-minded about the nature–nurture debate to an extent that I found surprising. Notably, however, Spearman made no attempt to conduct his own investigations into the heritability or malleability of g. What Spearman cared most about was situating g within a larger system of laws that could explain individual diferences in human abilities. The utility and consequences of invoking g as a basis for educational and social

Measurement Through Correlation

221

policies were not issues that Spearman considered with any care in his writing, and he left this for others to debate.

7.6

Spearman’s Conceptualization of Measurement

Some version of the term measurement had featured prominently in titles of the two major works that could be cast as the bookends to the most active and productive periods in Spearman’s scholarly career: “General Intelligence,” Objectively Determined and Measured (1904) and The Abilities of Man (1927). But when it comes to Spearman’s actual views as to the necessary and sufcient conditions for measurement, whether in physics or psychology, we have even less to work with than the bits and pieces to be found in the writing of Francis Galton. Over the course of his career, Spearman published four books, a short autobiography, and more than 100 journal articles. As far as I can fnd, there is really just one instance in this collection where he considers the boundaries of measurement as a concept or a method, and this is contained in just two-and-a-half pages in Volume 1 of his 1937 book Psychology Down the Ages (Spearman, 1937, 89–91). In these pages, Spearman seems to equate measurement with mathematical analysis, full stop. Having discussed the critical role of experiments and experimentation for advancing psychology, Spearman (1937) writes: But great as many be the potency of this [the experimental method], or of preceding methods, there is yet another one so vital that, if lacking it, any study is thought by authorities not to be scientifc in the full sense of the word. This further and crucial method is that of measurement, or rather of mathematics; for this latter is what science really needs. (89) To Spearman, the pursuit of psychology should be the discovery of laws that help explain the workings of the mind. The laws must be expressible as mathematical equations, and the quantitative variables in these equations need to be expressible as continuous magnitudes. All this is evident in the core equation of Spearman’s two-factor theory, ma i = ra g g i + ra sa sa i . The theory was premised on the hypothesis that gi and sa i exist, that they are quantitative, and that they combine linearly to produce ma i . Let’s consider this more carefully. Now, if the model holds, the defnition of the term on the left side of the equation, ma i , can be given a straightforward interpretation as a derived measurement that results from the linear combination of the two variables gi and sa i weighted by the fxed coefcients ra g and ra sa . In other words, if gi and sa i both exist and represent quantitative variables, ma i exists and represents a quantitative variable. However, the psychological state of afairs is rather complicated since gi and sa i are latent and thus only hypothetical. The only observable variable is ma i , and while syntactically it is represented as the

222 Measurement Through Correlation

output of the model on the left side of the equation (i.e., the measure), it is from the intercorrelation of these “outputs” that we need to derive the estimates of all the terms on the right side of the equation. So, from a theoretical perspective, ma i is an output, but from a design perspective, it is an input. Spearman was quite clear in assuming that both gi and sa i could be treated as though they were continuous quantities, and without a loss of generality, he imagined each to be standardized to have a mean of 0 and unit of 1. As a result of this, since gi and sa i were assumed to be orthogonal, there is an equivalent way to express the basic model we frst encountered in Equation 7.1: ma i = ra g g i + 1− ra2g sa i .

(7.7)

A consequence of this move, one that continues to defne most modern factor analytic models is that ma i is also expressed on a scale in which the unit is a standard deviation, and it is easy to lose track of major diferences that may exist in the structure of the original data that was collected before it was standardized. At one extreme, ma i could represent a subjective ranking by an observer (as in Spearman’s village school sample); at the other, it could consist of a measure expressed in the units of a physical attribute. In most cases, ma i would be the score from a mental test, but this score might derive from a single task or problem, or it might be a composite formed across multiple tasks or problems. The mathematics of the model Spearman originally proposed placed no restrictions on the “measures” that were eligible to be defned as a linear combination of gi and sa i . Measurement through correlation provides standard reference units throughout, but to a large extent, this is only an illusion. What if gi and sa i exist and are the primary causal agents behind test performance, as Spearman hypothesized, but exist as qualitative variables as opposed to quantitative ones? What insight could a factor analysis provide to this question? The answer is that it provides no insight whatsoever (Michell (1999; 2020). The fundamental theorem of modern factor analysis is the mathematical result that any correlation matrix of k variables (so long as it is positive defnite) can be reexpressed in terms of k by p matrix (the pattern matrix of general factor loadings) and another vector of length k (specifc factor loadings and measurement error). In a best-case scenario, Spearman’s method of evaluating tetrad equations would only provide evidence for whether p  = 1. It is insensitive to the possibility (for example) that g and s are both qualitative variables or that g is qualitative and s quantitative (or vice versa). The entire edifce of factor analysis is premised on observed and latent variables that are quantitative by fat and hence measurable by fat as well. This was true in Spearman’s time, and it continues to be true today. While the measurement by fat label applies to Spearman just as it applies to Galton, Spearman needs to be given credit for distinguishing between the instrumental task of measurement (whether through holistic rankings or the

Measurement Through Correlation

223

administration of a mental test) and the underlying ability that was the target of measurement. In this, he had seen a clear parallel to the physical sciences: It is no new thing thus elaborately to deal with and precisely measure things whose real nature is concealed from view; of this nature, for instance, is obviously the study of electricity, of biology, and indeed of all physical science whatever. (Spearman, 1904c, 258) What he believed he had already established in 1904 was a correlational method that could be used to objectively determine the contribution any given test could make to general and specifc measures of ability, irrespective of the labels that had been afxed to the tests in their creation. The approach Spearman was introducing was intended to be objective in the sense that the corroboration or falsifcation of the two-factor theory could be done independently of the subjective interpretation of any single test designer. A collection of tests either produced a correlation matrix with a hierarchical structure or did not, and if they did produce a hierarchical structure, then some tests would be more strongly related to the general factor than others. We can see in hindsight how this perspective was, at best, overly optimistic and, at worst, somewhat naïve, but the one thing that we can say about Spearman is that he was no operationalist. In short, Spearman’s overarching interest was in establishing a method that could be used to corroborate and defend a theory of mental abilities, not in establishing the ideal instrumental process by which his general intellective factor could or should be measured and compared. Spearman viewed his life’s work as bearing upon the scientifc task of measuring intelligence, not the instrumental task.13 Where Spearman showed the least interest was with the culminating stage of most measurement processes: the formation of a measuring scale with a welldefned reference unit. Ironically enough, it can be argued (and Spearman himself promoted this argument) that it was those who expressed the most reservations (if not outright hostility) to the two-factor theory who did the most to promote an instrumental approach to the measurement of intelligence that was consistent with it. Specifcally, in the United States, educational psychologists such as Edward Thorndike, Truman Kelly, and Louis Thurstone, who each regarded intelligence as a complex, multidimensional attribute, and who questioned the evidentiary basis of Spearman’s two-factor theory, nonetheless played an active role in expressing the results of an intelligence test battery on a single scale. The end result was that a person’s intelligence could be placed on, most famously, Terman’s Stanford–Binet IQ scale, and this became the de factor unitary measure of intelligence that drove practical applications. To the eyes of Thorndike, Kelley, and Thomson, what IQ represented was a formative composite of numerous cognitive attributes that collectively defned what it meant for a person to display intelligence. To Spearman, what was being reported was the refective attribute of g.

224 Measurement Through Correlation

In 1927, both Spearman and Thorndike published major books in collaboration with students and staf over many years at their respective institutions: The Abilities and Man and The Measurement of Intelligence, respectively. There was a stark diference in the tone of the two books. Thorndike’s book was a comprehensive statement of all the challenges that impeded the measurement of intelligence; Spearman’s book was a confdent narrative of how such problems could be solved. Beyond this, an interesting diference between the two books is that while Spearman’s book represented a comprehensive efort to corroborate his two-factor theory and situate it within a larger model of the mental architecture of the human mind, Thorndike’s book focused on the instrumental task of measurement and the need to defne and ascribe a meaningful reference unit to intelligence test scores: Existing instruments represent enormous improvements over what was available twenty years ago, but three fundamental defects remain. Just what they measure is not known; how far it is proper to add, subtract, multiply, divide, and compute ratios with the measures obtained is not known; just what the measures obtained signify concerning intellect is not known. We may refer to these defects in order as ambiguity in content, arbitrariness in units, and ambiguity in signifcance. (Thorndike et al., 1927, 1) In contrast to Spearman, Thorndike’s starting point was the instruments that were already being used to ostensibly measure human intelligence. He asks of these instruments, What abilities do they measure? What is the nature of the scale on which they are being measured? and What, collectively, do the tests tell us about the nature of intelligence? Now, the frst and third of these questions overlapped with Spearman’s line of investigation, but the second did not. Thorndike’s ideas for solving the problem of arbitrariness in units largely followed in the footsteps of Galton’s notion of relative measurement by invoking the assumption that whatever was being measured followed the normal distribution. In fact, just two years earlier Thurstone had introduced a method, also premised on an elaboration of a normality assumption, that made it possible to express tests given to children at diferent ages on a common absolute scale with a particular test-specifc standard deviation as the unit of measurement (Thurstone, 1925). Among the handful of giants of quantitative psychology in the frst half of the 20th century, Spearman had been well-positioned to attempt to form connecting lines between the scientifc and instrumental tasks of measuring a psychological attribute. Spearman had been trained in the German school of experimental psychology before immersing himself in the study of individual diferences. He was surely familiar with both Fechner’s method of psychophysical measurement and Galton’s method of relative measurement. Yet while

Measurement Through Correlation

225

application of the inverse normal distribution had been central to Fechner’s and Galton’s methods of measurement, in his own work, Spearman seems to have purposefully avoided invoking normality assumptions (Spearman, 1908). In fact, on at least one occasion, Spearman questioned the normality assumption as the underpinning for Thorndike’s methods of creating an educational scale (Spearman, 1927b). But nowhere can we fnd Spearman elaborating or enacting his own preferred instrumental solution. Nor was this an issue taken up by any of his students or intellectual disciples. Ever since, questions regarding the nature and structure of human intelligence have been almost addressed separately from the questions regarding the properties of the scale or scales on which measures of intelligence are expressed.

Notes 1 The parenthetical “or group of Functions” is easy to overlook, but it was meant to signal the possibility that on further investigation, g might be comprised of multiple factors. In this sense, Spearman’s original statement of the theory was not incommensurate with Godfrey Thomson’s (1916) sampling theory of ability or even Thurstone’s (1947) multiple-factor model, both of which we will encounter in the next chapter. 2 The frst instance I can fnd in which Spearman refers to the general factor with the shorthand “g” is in the frst footnote of Spearman (1914a) when he writes “Let a, b, p, and q denote any four abilities, each assumed to depend partly on a specifc independent factor, and partly on a general factor; call the latter G.” 3 While this assumption was generally plausible for Spearman’s original prep school sample because it involved boys of the same age and advantaged educational background, differences in “training” would be equivocal in most other settings. This led Spearman to promote the need for mental tests designed to be insensitive to training. 4 The basic strategy behind the estimation of factor loadings goes something like this. Say we have three measures, m1 , m2 , and m3, and the corresponding three pairwise correlations: rm1m2 , rm1m3 and rm2 m3. Now, according to the generalization of Equation 7.3, rm1m3 = rm1 g rm3 g rm1m2 = rm1 g rm2 g rm2 m3 = rm2 g rm3 g

(a) (b) (c)

To solve for rm1 g , multiply both sides of (a) and (b) and divide by (c). rm1m2 * rm1m3 rm2 m3

=

rm1 g rm2 g * rm1 g rm3 g rm2 g rm3 g

After canceling the common terms, rearranging, and taking the square root of both sides,  rm1 g =

rm1m2 * rm1m3 rm2 m3

.

This explains why a minimum of three measures is needed to estimate loadings for a single factor. For details, see Spearman (1927a, xvi–xvii). For an even clearer and more complete treatment, see Thurstone (1947, Chapter 12).

226 Measurement Through Correlation

5 I use this particular correlation matrix instead of the more famous one Spearman referenced in General Intelligence from his prep school sample because, as Fancher (1985a) has demonstrated, there is good reason to be suspicious about the accuracy of this matrix. Spearman was frustratingly vague about the details of many of his calculations in General Intelligence but especially those that pertained to the hierarchy of correlations, which he only illustrated using his prep school sample. 6 Every set of four variables in a correlation matrix results in three unique tetrad equations. The number of distinct tetrad equations has the following functional relationship n (n −1)(n − 2)(n − 3) n ! with the number of variables, n, in a correlation matrix: 3C 4n = = . 8 8 So, for the fve tests in the Bonser correlation matrix, there would be a manageable total of 15 tetrad equations. But for a collection of 20 mental tests, the total would balloon to 14,535 unique tetrad equations! The fact that these would have to be solved by hand led Spearman to initially look for a simpler statistical criterion involving the correlations of the columns of a correlation matrix (Hart & Spearman, 1912). The faws in this approach (pointed out by Brown & Thomson, 1921) led him back to direct evaluation of the tetrad equations, and one of Spearman’s students, William Stephenson, actually computed 14,535 tetrad diferences in support of a study in collaboration with Brown (Brown & Stephenson, 1933). 7 A simplifed version of this formula is as follows. Let Tk represent the theoretical diference from a tetrad equation (where, in the Bonser example, k  = 1, . . . , 15), which has an expected value of 0. The standard error of the tetrad diference is σTn = 2r (1− r ) / N , where r is the average of the four correlations in the tetrad and N is the number of individuals in the sample. There will be a unique standard error associated with each tetrad diference. 8 Spearman (1927a, v) explicitly points this out in the appendix of The Abilities of Man. 9 See Hearnshaw (1964) and Norton (1979). Critical contemporary reviews were by Collingwood (1923), Wheeler (1924), and commentary in Thorndike et al. (1927). Also see Michell (2020). 10 Even in The Abilities of Man, the focus of the chapters is on laws of ability and cognition and how they appear to interact. Details about the underlying methods were placed in the appendix of the book. 11 Raymond Cattell would later make the distinction between fuid and crystalized intelligence. In this sense, Spearman was interested in understanding the nature of fuid intelligence and its measurement. 12 See Spearman (1931) for a prime example of this viewpoint. 13 In his autobiography, Spearman (1930) notes that he intentionally avoided getting involved in the practical application of intelligence testing, giving the curious justifcation that “the practical application of tests necessitates their standardization; and standardization spells scientifc stagnation” (326). This is not to say that Spearman was completely removed from the design and use of intelligence tests; to the contrary, he was clearly involved in this both in the context of studies conducted by and with his students and as an advisor to others who developed such tests for commercial and public purposes. But test design and use were never Spearman’s focus. They were a necessary means to the end of establishing a theoretical edifce for his physics of the mind.

8 THEORY VS. METHOD IN THE MEASUREMENT OF INTELLIGENCE

8.1

Challenges to the Theory of Two Factors

Spearman’s two-factor theory was met with controversy from the moment it was frst published. Many of the early reactions to Spearman’s thesis between 1904 and 1912 focused primarily on the method of corroboration Spearman had demonstrated that relied on the application of his newly introduced disattenuation formula (see Chapter 6). Pearson had been the frst to immediately question the legitimacy of the formula and its application. In separate studies, Thorndike and Brown replicated Spearman’s approach with newly collected data and showed that the disattenuated correlations of various mental tests were far below unity. On the other hand, a follow-up study by Krueger and Spearman (1906) and a study by Burt (1909) ofered strong support.1 A major turning point came with the publication of Hart and Spearman (1912). At this stage, Spearman seems to have fully abandoned the use of disattenuated correlations to corroborate the two-factor theory, and here we can see the elaboration of a method based solely on the evaluation of a matrix of correlation coefcients for hierarchical order. From this point forward, debates over the validity of the two-factor theory were generally framed within the context of the proper analysis and interpretation of the patterns of intercorrelations found after administering some collection of mental tests to a sample of individuals. I summarize what I view as four distinct stages of Spearman’s promotion, elaboration, modifcation, and defense of the two-factor theory in Figure 8.1. In the next three sections that follow, we focus attention on three central objections that were raised about the two-factor theory between 1912 and 1950. The frst objection was with the nature of Spearman’s empirical warrant for the two-factor theory, and this was conveyed most prominently and most persuasively DOI: 10.1201/9780429275326-8

Spearman (1904). “General Intelligence,” Objectively Determined and Measured. Evolution of Theory

Major Challenges to Theory

Application and refinement of methods used to test the theory

Spearman (1906, 1907, 1910, 1913); Krueger & Spearman (1906); Burt (1909)

Critiques of methods used to test the theory; skepticism a bout theory (1905–1910)

Pearson (1909); Thorndike et al. (1909); Brown (1910);

Claim that two-factor theory has been corroborated; implications for practice

Hart & Spearman (1912); Spearman (1914a), (1914b)

Thomson’s “Sampling Theory” as alternative explanation; conflicting evidence from the Thorndike’s analysis of Army Alpha

Thomson (1916, 1919, 1920), Brown & Thomson (1921); Thorndike (1921)

Stage 3 (1923–1933)

Elaborated defense and modification of the twofactor theory and methods for testing the theory

Spearman (1923; 1927); Spearman & Holzinger (1924); Garnett (1920); Piaggio (1933)

Pearson’s attack; WIison raises concerns about factor indeterminacy; Thorndike’s competing approach to measurement

Pearson (1927a, 1927b); Pearson & Moul (1927), Wilson (1928a, 1928b, 1933a, 1933b, 1933c); Thorndike et al (1927); Kelley (1928)

Stage 4 (1933–1950)

Further modifications and consolidation of evidence and defense of two-factor theory

Spearman (1932; 1933a; 1933b; 1934a; 1934b; 1934c); Spearman (1946); Spearman & Jones (1950)

Generalization and reframing of Spearman’s methods to context of multiple factor analysis

Thurstone (1934, 1939, 1947); Burt (1940); Thomson (1950)

Stage 1 (1905–1912)

Stage 2 (1912–1923)

FIGURE 8.1

Spearman’s Program of Research Related to the Two-Factor Theory.

228 Theory vs. Method

Origin of Two-Factor Theory

Theory vs. Method

229

by Godfrey Thomson as part of an active debate with Spearman that lasted some 20 years. The second objection is especially relevant to the topic of this book. Namely, it was a largely unresolved debate between Spearman and the mathematician Edwin Wilson over the claim that g was measurable. The third objection was over the implications for the existence and interpretation of “group factors” thought to fall somewhere between Spearman’s general and specifc factors. It was this issue that ultimately led Thurstone to develop a competing paradigm for the method of factor analysis that would come to supplant the one Spearman had originated.

8.2

Godfrey Thomson’s Sampling Theory of Ability

Godfrey Thomson was a contemporary of Spearman’s and one of the most important quantitative psychologists of the frst half of the 20th century. On multiple levels, Thomson served as an ideal foil for Spearman. Thomson had received his training in physics and mathematics before becoming interested in the new feld of psychophysics, and much like his eventual collaborator, William Brown, he became interested in the topic of individual diferences and the use of mental tests and correlational methods to analyze individual diferences. Like Spearman, Thomson was both creative and single-minded. Once he latched onto a promising idea, he would develop it for years, and he enjoyed academic debate. But Thomson was even more adept and fuent with mathematics and probability theory than Spearman, and he was on better professional terms with many of Spearman’s other preeminent rivals, most notably Pearson and Thorndike. Just prior to accepting a position as a professor of education at the University of Edinburgh in 1925, Thomson had been given the task of designing the frst national program of standardized testing in Scotland, a program that became known as the Moray House Tests. The tests were to be used to identify 11-yearold children who would be eligible for a scholarship that would aford them free secondary education. As Thomson himself had grown up in a rural setting in northern England and had only gotten access to a secondary educational through his successful performance on a competitive examination, this was a responsibility he took quite seriously.2 It was left to Thomson to design tests for which a child’s performance would reveal an aptitude for further education, something that was distinct from the scholastic opportunities aforded to them by their circumstances. What Thomson created for this purpose drew on the advances in mental testing that had been initiated by Binet and followed from the same pragmatic ethos—the desire to use a standardized test to provide an objective basis for awarding fnite resources and opportunities. The means to that end were intelligence tests by any other name, and Thomson’s use of them for these selective purposes should have been greatly bolstered if Spearman’s two-factor theory could be corroborated. It is therefore somewhat ironic that

230 Theory vs. Method

Thomson emerged as one of the foremost thorns in Spearman’s side with his repeated challenges to the two-factor theory between 1914 and 1935. As of 1912, Spearman, in collaboration with Bernard Hart, had successfully demonstrated that if the two-factor theory was true, then a correlation matrix that satisfed the tetrad equations would follow. He had then provided empirical evidence of correlation matrices that apparently satisfed the tetrad conditions within the limits of sampling error. Did this evidence corroborate the two-factor theory? Thomson argued that it did not. The two-factor theory could explain the observation of a hierarchical correlation matrix, but there were other possible—and, in Thomson’s view, better—explanations. Because there was more than one possible cause of the same efect, one should avoid the mistake of confusing correlation for causation. To make this point, all that was required was the shufing and drawing of playing cards and the rolling of some dice. The linchpin of Thomson’s argument was the concept of an overlapping group factor. For now, we can formally defne a group factor as a factor that will always be common to certain tests of mental ability but not others. A competing theory to Spearman’s two-factor theory was that intelligence should be regarded as “multifocal” and that mental tests should be written with this in mind so that each test would draw on a small number of mutually exclusive abilities. For example, tests a and b would be written to elicit diferences in short-term memory, tests c and d written for verbal reasoning, and tests e and f for spatial visualization. In Hart and Spearman (1912), it had been argued that this theory could be ruled out because if it were true, when these kinds of specialized mental tests were examined, it would seem to require the existence of correlation matrices with a predictable mixture of high (e.g., tests a and b) and moderate to low (e.g., tests c and e) correlation coefcients. Hart and Spearman had noted that no one was fnding these sorts of empirical patterns. Thomson argued that this was an unrealistic way to think about group factors. What if, he reasoned, performance on any mental test depended on taking a sample from a large number of much more narrowly conceived atomic factors present in a person’s brain? And what if these factors were not mutually exclusive across tests, as Spearman was assuming, but overlapping, such that they were present in some test combinations but not others? Could the presence of overlapping factors produce a pattern of positive test correlations? Would these patterns follow the hierarchical structure Spearman had taken as evidence of the two-factor theory? Thomson surmised that the answer to these last two questions was yes and set about constructing a simulation to prove it. It is to the details of this simulation that we now turn. Thomson simulated data to support his argument in a number of diferent ways, frst just by rolling dice (Thomson, 1916) and then, later, as he developed the idea further, by drawing cards and rolling dice in combination (Thomson, 1919a, 1920b).3 I describe the latter version that Thomson used to illustrate the

Theory vs. Method

231

approach in Brown and Thomson (1940, 174–179). The setup begins by imagining that we have a collection of 10 tests whereby each test is composed of some collection of tasks or items. These tests generate scores x1 , x2 , . . . , x10 that vary across the individuals who take the tests. Next, imagine that performance on the test depends on the presence of some combination of group factors and specifc factors. To decide how many of these factors are associated with each test, we draw cards with replacement from a well-shufed deck. We ignore the suit of the card and focus only on its numeric value, assigning the ace a value of 1 and the jack, queen, and king the values of 11, 12, and 13. As a consequence of this design, each test can be associated with a minimum of 1 and a maximum of 13 group factors, and the same holds true for the specifc factors. Table 8.1 illustrates this process, where the rows represent each test, and the columns represent the number of associated group and specifc factors. For the frst test, we draw a fve for the group factors and another fve for the specifc factors, for a total of 10 mental factors that govern performance on this test. For the second test, we happen to draw a fve again for the group factors, but this time three for the specifc factors and so on through the 10th test. Obviously, and importantly, the number of group and specifc factors that have been assigned to each hypothetical test so far have been selected at random. Let nx i represent the number of group factors (x) for a given test (i). Now, to establish the degree of overlap among group factors across tests, for each test i = 1, . . . , 10, we draw nx i cards at random, but this time without replacement from only the 13 cards of a single suit.The results are shown in Table 8.2.The rows of the table again represent each of 10 tests, but this time the columns show each of the 13 possible group factors. For the frst test, fve cards were drawn from the deck: an ace, a two, a fve, a seven, and a king. For the second test, another fve cards were drawn from the reshuffed deck of 13: a fve, a seven, a nine, a ten, and TABLE 8.1 Random Assignment of Number of Group and Specifc Factors to Tests

Test

Group Factors

Specifc Factors

Total

1 2 3 4 5 6 7 8 9 10

5 5 12 1 7 9 13 1 9 11

5 3 12 3 6 5 13 2 3 5

10 8 24 4 13 14 26 3 12 16

232 Theory vs. Method TABLE 8.2 Group Factors (columns) Assigned at Random to Each Hypothetical Test (rows)

 

Ace

2

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

X

X

X

X

X

X

X

3

X X X X

X

4

5

6

X

X

X X X

X

X X

X X X

X X

X X

X X

7

8

9

10

J

X X X

X

X X

X X

X

X X

X X X

X X X

X X

X X

X X

X X

X X

X

Q

K X

X X

X X X X X X

X

X

X

X X

X X

TABLE 8.3 Patterns of Overlap in Group Factors by Test Pairing

  x1 x2 x3 x4 x5 x6 x7 x8 x9 x10

x1

x2

x3

x4

x5

x6

x7

x8

x9

x10

– 2 5 0 5 2 5 1 5 4

2 – 5 1 3 5 5 0 2 4

5 5 – 1 7 8 12 1 8 10

0 1 1 – 1 1 1 0 0 1

5 3 7 1 – 4 7 1 5 6

2 5 8 1 4 – 9 0 5 8

5 5 12 1 7 9 – 1 9 11

1 0 1 0 1 0 1 – 1 1

5 2 8 0 5 5 9 1 – 8

4 4 10 1 6 8 11 1 8 –

a queen, and so on. Note that for each of the possible 45 pairwise combinations of tests we see a different pattern of randomly generated overlap in group factors. Table 8.3 summarizes this pattern of overlap.4 Thomson’s last step was to simulate observed test scores for some hypothetical sample of students, whereby the generation of the test scores was in accordance with the factorial structure of the tests captured in Tables 8.1 through 8.3.This is where the dice came in.Thomson imagined a scenario with just 30 individuals to mimic the small samples in Spearman’s (1904c) original General Intelligence studies. To generate a series of 10 test scores for the frst person, we roll 70 dice, 13 that correspond to group factors, and another 57 that correspond to the specifc factors. For the score of this person on test 1, take the sum of the fve dice that correspond to this test’s general factors (see the frst row of Table 8.2) and the fve that correspond to this test’s specifc factors (see the frst row of Table 8.1).The next part is crucial.

Theory vs. Method

233

TABLE 8.4 Thomson’s Simulated Correlation Hierarchy

 

x10

x5

x9

x6

x7

x1

x3

x2

x4

x8

x10 x5 x9 x6 x7 x1 x3 x2 x4 x8

– .66 .61 .71 .69 .45 .52 .37 .24 −.07

.66 – .67 .57 .52 .45 .36 .33 .25 .19

.61 .67 – .49 .58 .58 .40 .28 .10 .03

.71 .57 .49 – .57 .28 .42 .58 −.01 .01

.69 .52 .58 .57 – .33 .58 .43 −.11 .02

.45 .45 .58 .28 .33 – .59 .23 .27 -.06

.52 .36 .40 .42 .58 .59 – .23 −.14 .05

.37 .33 .28 .58 .43 .23 .23 – −.14 .05

.24 .25 .10 −.01 −.11 .27 −.14 .04 – −.10

−.07 .19 .03 .01 .02 −.06 .05 −.14 −.10 –

To generate a score for the same person on test 2, retain the values for the factors that overlap with tests 1 and 2 (i.e., the fve and seven cards) and then add in six new values: three that correspond to group factors on test 2 that do not overlap with test 1, and three that correspond to a new set of factors specifc only to test 2.The same process is followed to generate all 10 test scores for this hypothetical person; then the process is repeated 29 more times, each time rolling a new set of 70 dice so that we will now have a data matrix of 30 rows (one for each hypothetical person) and 10 columns (one for each test).The simulated data are now complete. With the simulated data in hand. Thomson could compute an observed correlation matrix, and this is depicted above in Table 8.4. As we can see, almost all the intercorrelations are positive. In addition, once the rows and columns were appropriately ordered,Thomson noted that the results take on a close approximation to a hierarchical structure, and as he would demonstrate, it would satisfy the statistical criterion5 that had previously been introduced and applied in Hart and Spearman (1912). In summary,Thomson had demonstrated that there was no way to distinguish, through an empirical analysis of correlation coeffcients, between a mental theory built on a single general factor with no group factors and one that was built on no single general factor but numerous overlapping group factors. Thomson (1916) found himself walking a fairly tight rope. On one hand, he had been careful not to claim that his simulation should be taken as an outright falsifcation of the two-factor theory: It must not be hastily and illogically concluded by anyone that therefore General Ability is a fction. Its existence or non-existence is, as far as the mathematical argument goes, an entirely open question, which will not be answered mathematically until someone successfully carries out a very much more extensive set of experiments than has yet been attempted. (280)

234 Theory vs. Method

On the other hand, the more he thought about it, the more convinced he became that what might have begun as a contrived demonstration was actually a theoretically superior explanation for the facts on the ground, an explanation he named (somewhat confusingly) the sampling theory of ability. To further distinguish his theory from the two-factor theory, he eventually replaced his usage of “group” and “specifc” ability factors (as in the simulation example earlier) with the more generic term “neural bonds,”6 some of which overlapped tests and some of which did not: What the “bonds” of the mind are, we do not know. But they are fairly certainly associated with the neurons or nerve cells of our brains, of which there are probably round about ten thousand million in each normal brain. Thinking is accompanied by the excitation of these neurons in patterns. The simplest patterns are instinctive, more complex ones acquired. Intelligence is possibly associated with the number and complexity of the patterns which the brain can (or could) make .  .  . Intelligence tests do not call upon brain patterns of a high degree of complexity, for these are always associated with acquired material and with the educational environment, and intelligence tests wish to avoid testing acquirement. (Thomson, 1951, 313) According to Thomson then, the presence or absence of a given mental bond might be associated with a particular pattern of neuronal activity in the brain. While these patterns would surely be almost infnite and varying in complexity, it might be reasonable to assume that the patterns associated with intelligence test tasks might be fnite and simpler. In what sense, then, were these bonds being sampled? This occurs in two different ways. First, because every test elicits mental activity in the form of bonds, and tests vary in the mental activity they are intended to elicit, every test can be viewed as requiring a sample of bonds. Now, since tests are usually purposefully constructed to discriminate between different abilities, it may seem odd to regard each test as a random sample from the same population of bonds. Thomson’s sampling theory would break down and turn into a multifocal theory of mutually exclusive group factors if the full population of bonds could be stratifed and then tests were designed to purposefully require certain bonds in each stratum. However, because test designers are ignorant of the fne-grained match between items and bonds, for all intents and purposes, it might be possible to regard each test as if it were a random sample. Thomson also made the complexity of each test random by using draws from a card deck to determine the number of overlapping and unique bonds in each test. In Thomson’s simulation, tests 2, 4, and 8 in Table 8.1 would be examples of simpler tests; tests 3 and 7 would be examples of more complex tests. So a frst source of mental bond overlap comes from the ways that tests are designed, and this was represented in Thomson’s simulation, in a highly simplifed form, by imagining

Theory vs. Method

235

samples drawn from a small population of bonds specifc to the mental activities required in 10 tests. The second source of sampling goes on by individual, because even though a test might require some collection of bonds to complete all the tasks correctly, there is no guarantee that each person taking a test would have all the necessary bonds, or the necessary strength in each bond, at their disposal: The sampling theory would consider men also to be samples, each man possessing some, but not all, both of the inherited and the acquired neural bonds which are the physical side of thought. Like the tests, some men are rich, others poor, in these bonds. Some are richly endowed by heredity, some by opportunity and education; some by both, some by neither. (Thomson, 1951, 316) This sense of sampling was represented in Thomson’s simulation by the rolling of dice to generate individual-specifc values for the overlapping and specifc bonds on each test.7 It is surely not the case that the levels of bonds (the realized die values) among any group of individuals taking a common collection of tests will be a random sample, at least not an independent random sample. The more that the individuals come from a similar class and culture and have been given similar educational opportunities, the more that we might expect them to have a similar reservoir of bonds needed to complete each test successfully. This would add some correlation into the successive dice rolls. The scenario in Thomson’s simulation requires the thought experiment of an independent random sample of 30 individuals from some well-defned population.8 In summary, Thomson was able to show that when test scores can be conceptualized as the sum of independent random variables arising from the interaction of two forms of sampling (tests and people), the correlation matrix that will result will tend to approximate the hierarchy that Spearman had taken as evidence in support of his two-factor theory. Thomson went one step further to argue that the hierarchy that had so captured Spearman’s attention should be regarded much more as a rule rather than an exception.9 I present the results of a modern instantiation of Thomson’s simulation using computer code (in place of cards and dice) in the Appendix to this chapter.

8.3

Edwin Wilson and the Indeterminacy of g

There was another problem that somehow had escaped attention until it was brought to light in a review of The Abilities of Man written by the mathematician Edwin Wilson (1928a). Namely, even if one grants that g exists and that it is the cause of an observed hierarchical correlation matrix among some collection of tests, it cannot be uniquely measured. And to turn the phrase from Thorndike’s Credo, if g can’t be measured, is there value in asserting that it exists?

236 Theory vs. Method TABLE 8.5 Wilson’s Hypothetical Example

Person

1 2 3 4 5 6 Mean SD rg

Test a

Test b

Test c

ma

ga

mb

gb

mc

gc

10 8 6 4 2 0

1.23 0.74 0.24 −0.25 −0.74 −1.23

8 5 9 7 0 1

0.65 0.00 0.86 0.43 −1.08 −0.86

7* 9 4 8 1 1

0.47 0.94 −0.23 0.70 −0.94 −0.94

5 3.74 .92

0 .92

5 3.74 .81

0 .81

5 3.52 .83

0 .83

Note: The observed correlations between the three tests are as follows: ra b = .74, ra c = .73, rb c = .66. *In Wilson’s original review, this number is shown as “1,” but this is clearly a typo in the print setting as only a value of 7 satisfed the reported correlations.

At the time of his review of Spearman’s book, Wilson was Professor of Vital Statistics at Harvard University and before that held senior appointments in pure mathematics and mathematical physics at Yale and the Massachusetts Institute of Technology (Lovie & Lovie, 1995). Wilson, another polymath, had become interested in statistics, and by all accounts, he was nothing if not a formidable character. What he had been looking for in Spearman’s book, given that it represented the culmination of two decades of empirical research, was a single worked example composed of one set of nk scores for n individuals on k tests worked through to the determination of the gx, gy, . . . of the general intelligences of these individuals and of the nk values of their special abilities on each of the tests. Theorems which prove the existence of some possibility do not satisfy the practical applied mathematician—we do not so much want to know that there is a solution to the problem as to know what the solution is! (Wilson, 1928a, 245) To remedy this,Wilson constructed a simple scenario (represented here by the data and summary statistics shown in Table 8.5), in which six students have been scored for their performance on three different tests (a, b, and c). Now, it can be shown that for these data, ra g = .919, rb g = .809 and rc g = .826. In other words, test a is the most strongly saturated with g, followed by tests b and c, which have very similar correlations. Table 8.5 also shows the estimates of g we get by applying the formulas Spearman had presented in the appendix of The Abilities of Man (see

Theory vs. Method

237

Equations 7.4 and 7.5 from the last chapter) to the standardized values from tests a, b, and c for each of our six students. Each test produces a different estimate of g, and in most cases, the differences are quite large. Spearman (1927a, xviii–xix) had been aware that estimates of g could vary by test and, in recognition of the problem, had suggested an alternative approach for computing g on the basis of a single composite test score weighted to maximize the total correlation with g. The point remains that Spearman had conceived of the measurement of g as if it were the prediction from a regression model, and as such, Wilson was pointing out that he had, in efect, omitted the error term. Instead of g i = ra g ma i , the full model is g i = ra g ma i + e a i . Only under the assumption that ea i is a random variable with an expected value of 0 could we write E ( g i|ma i ) = g˘ˆi = ra g ma i . Hence, the approach Spearman had put forward for “measuring” each person’s g was premised on a conceptualization in which g is composed of a determinate portion (ra g ma i ) and an indeterminate portion (ea i). And the indeterminate portion could be substantial. Even after pooling the tests using the approach Spearman had suggested, Wilson found there would be a standard error of measurement of .45 for both general and specifc abilities, extremely large for a scale defned to have a standard deviation of 1. And the situation would be much worse for the specifc abilities, especially with the recognition that the more precisely a given test measures g, the worse (by defnition) it will measure s. Wilson (1928a) concluded that Spearman was not measuring g or s “any more than he would weigh a person by computing his weight from his height through a regression equation of weight on height” (245). But this was, in fact, exactly the aspiration Spearman had in mind for his technique of measurement through correlation. The reason for the indeterminacy of g is apparent as soon as we recall that the two-factor model posits a multivariate system of equations, one per test, for each person. Because performance on each test is a function of two factors, g, which stays constant, and s, which is specifc to each test, for every k test scores observed, there will always be k + 1 variables that are latent and for which we desire a measurement.Again, using the data shown in Table 8.5,Wilson demonstrated the range of possible values for g that would satisfy a system of k constraints based on the assumptions of the two-factor theory. The range of possible values for each student’s g Wilson found using this approach were even larger than what had been found using Spearman’s regression approach.10 Wilson noticed a second obstacle to the measurement of g, this one specifc to a concern about the arbitrary nature of test scores. A precondition that

238 Theory vs. Method

Spearman had imposed for any battery of tests that could be expected to satisfy the two-factor theory was that each test needed to be “sufciently dissimilar” from the others.11 For any two test scores that were not sufciently dissimilar, ma i and mb i, Spearman recommended one of two moves, either dropping one of the redundant tests or pooling them into a new test score. In the latter case, we can express the pooled test score as ma ′i = w1ma i + w 2mb i . Notice that the values of the newly defned ma ′i depend on the weights placed on ma i and mb i in the form of the two constants, w1 and w2 , and these are, at least implicitly, at the discretion of the researcher.Wilson expressed this idea as a principle that applied not just to a particular subset of tests in a battery but to any subset of tests. Recall Table 8.5, which presented the hypothetical scores of six students on three tests. For any student, i, we observe ma i , mb i , and mc i . But we can easily generate three alternative test scores ma ′i , mb ′i , and mc ′i by imposing the following transformations: ma ′i = w1ma i + w 2mb i + w 3mc i mb ′i = w 4ma i + w 5mb i + w 6mc i mc ′i = w 7ma i + w 8mb i + w 9mc i Now, if the original set of measures ma i , mb i , and mc i satisfed Spearman’s conditions for hierarchical order, under what conditions can we expect the same order to be found when the original measures are linearly combined to form ma ′i , mb ′i , and mc ′i ? The answer is that there is an extremely large (but fnite) number of values for the w’s that would maintain the hierarchical order, but there are an even larger (and infnite) number that would not. And even among the values that would maintain the hierarchy, it would be necessary to include a combination of weights with both positive and negative values. For example, if we apply the transformations below to the scores of the frst student in Table 8.5 such that 28 = 2 * 10 + 1 * 8 + 0 * 7 28 = .5 * 10 + 2 * 8 + 1 * 7

21 = 0 * 10 + 0 * 8 + 3 * 7, we fnd that the original scores of 10, 8, and 7 are transformed to 28, 28, and 21.Wilson’s argument was that from a mathematical perspective, no information about the student has been lost; it has just been differently assembled.12 However, these transformations, applied to all six hypothetical students, change the observed intercorrelations, as shown in the correlation matrix of Table 8.6.The correlations below the main diagonal are based on the original test scores ma x, mb x, and mc x for

Theory vs. Method

239

TABLE 8.6 Correlation Matrix With Tests Before (lower triangle) and After Transformation (upper triangle)

 

Test a

Test b

Test c

Test a Test b Test c

1.00 .743 .756

.948 1.00 .668

.774 .846 1.00

the six students; the correlations above the main diagonal are based on the transformed test scores ma ′x , mb ′x , and mc ′x using the weights w1 , . . . , w9 shown earlier. Notice that not only are the correlations uniformly higher but that they also now take on a diferent hierarchical order. Depending on the choice of transformation coefcients, it would be possible to generate correlation matrices that meet either extreme of perfectly failing or perfectly satisfying the two-factor theory, and there is little way to predict which outcome is more likely: What does this leave of the concept of the intelligence of an individual x as measured by gx? Apparently only that it is relative to the set-up, which is the obvious proposition that I set out to prove. (Wilson, 1928a, 247)

8.4

Louis Thurstone’s Multiple-Factor Method

In 1936, the American psychologist J. P. Guilford published the frst edition of the textbook Psychometric Methods, a comprehensive treatment of the methods of quantitative psychology. His book surveyed the methods that had begun with psychophysics, taken on a new scope with the advent of the correlational study of individual diferences, and then, seemingly, had culminated in the newly developed method of factor analysis. Figure 8.2 is reproduced from Guilford’s fnal chapter, and in it, he depicts three prominent and competing theories of ability that had been proposed to explain individual diferences in mental test performance. The frst two theories, marked with an A and B, represent Spearman’s theory of two factors, and Thomson’s sampling theory of ability, respectively. The third theory, marked with a C, corresponded to a multiple-factor theory of ability. This theory proposes that performance on cognitive tasks depends on a relatively small number of distinct abilities that operate together in diferent weighted combinations, depending on the nature of the task. In the graphic, the distinct abilities are represented by the ellipses with roman numerals I through VII. Additional ellipses have been superimposed to represent three hypothetical tests, marked with the lowercase letters a, b, and c. According to this theory, the

240 Theory vs. Method

a

sa

h sh g

sb

sg

b

G sc

sf sd

se

a

A a b

II Attention

I Space

V Verbality IV Number

c B

d

c

VIII Relations

III Memory

VI Imagination b

C FIGURE 8.2

Competing Theories of Mental Ability as Represented by Guilford (1936).

Source: Guilford (1936).

common explanation of individual diferences on all three tests is not some general factor, but varying combinations of verbal ability, numeric ability, and the like. In this illustration, test a variability is equally well explained by underlying diferences in a person’s attention and memory. In contrast, for test b, it is verbality and relations that take on the greatest weight, with attention, space, and number playing a much smaller role, and for test c, verbality, imagination, and relations are the presumed causal agents that take on equal roles. This multiple-factor theory of ability was not exactly new, as it had enjoyed early support in the United States from Thorndike (1921) and Kelley (1928). But it was Thurstone who was most responsible for introducing and popularizing a new approach for factor analysis that was consistent with this theory. Superfcially, what Thurstone proposed was a generalization of Spearman’s model (introduced in Equation 7.1 of the previous chapter). As an equation, it can be written as follows: ma i = ra f1 f 1i + ra f 2 f 3i +…+ ra f K f K i + ra sa sa i .

(8.2)

Theory vs. Method

241

As before, ma i is the measure produced for a given test and person, but this is now being expressed as a linear combination of multiple factors, some that may be common across diferent tests and one (sa i) that is always unique. Conceptually, however, what Thurstone was proposing was a diferent paradigm for the study of mental abilities. At the outset of his classic article “The Vectors of Mind” (Thurstone, 1934, which later became a book, Thurstone, 1935), he articulates his sense for the problem that needs to be addressed: It has been customary to postulate a single common factor (Spearman’s ‘g’) and to make the additional but unnecessary assumption that there must be nothing else in common with any pair of tests. Then the tetrad criterion is applied and it usually happens that a pair of tests in the battery has something else in common besides the most conspicuous single common factor. For example, two of the tests may have in common the ability to write fast, facility with geometrical fgures, or a large vocabulary. Then the tetrad criterion is not satisfed, and the conclusion is usually one of two kinds, depending on which side of the fence the investigator is on. If the investigator is out to prove ‘g,’ then he concludes that the tests are bad because it is supposed to be bad to have tests that measure more than one factor! If the investigator is out to disprove ‘g’, then he shows that the tetrads do not vanish and that therefore there is no ‘g.’ Neither conclusion is correct. The correct conclusion is that more than one general factor must be postulated in order to account for the intercorrelations, and that one of these general factors may still be what we should call intelligence. (Thurstone, 1934, 4) In short, Thurstone viewed questions about human abilities and the nature of intelligence as questions that required further systematic exploration, rather than questions for which a compelling theory had already been established. In Thurstone’s view, the appearance of group factors should not be taken as annoyances to be accommodated within the two-factor theory but as an invitation to consider new competing theories. Another distinguishing feature of Thurstone’s approach came in its mathematical formalization. Where Spearman had formulated a model in strictly algebraic terms, the added complexity of generalizing the model to incorporate multiple factors led Thurstone to cast the analysis of correlation coefcients as the mathematical problem of establishing the rank of a square matrix. This was a problem for which many solutions could be brought to bear once it was cast in the language of matrix algebra, which Thurstone promptly taught himself, with some tutoring from one of his research assistants. The matrix algebra formalization also lent itself to geometric interpretations and visualizations of the factor analytic challenge, and this is why Thurstone cast the methodological challenge as one in which test “vectors” needed to be located relative to factorial “axes.”

242 Theory vs. Method

As characterized by Thurstone, when analyzing a battery of test scores, each test could be conceptualized by a location in n-dimensional space. A space with n  = 1 corresponded to a scenario with a single common factor, in which case a test’s location would be a single point characterized by its loading. But for a space with n  = 2, a test’s location would be characterized by a vector, and with each new dimension, an additional vector would be needed to characterize the test’s location. Unfortunately, while coordinates for each test’s location could be established, the axes for interpreting these coordinates were indeterminate. It was the job of the factor analyst to rotate the axes toward an interpretation that was substantively interpretable relative to psychological theory. The criterion Thurstone established to guide this rotation was that of simple structure, which held that all factor loadings such be positive (within the limits of sampling error) and that the number of zero loadings between tests and prospective explanatory factors should be maximized. Finally, in rotating the axes of a factor solution, Thurstone identifed one additional variable, namely, whether the axes should be constrained to be orthogonal to one another during rotation or whether they should be allowed to rotate at oblique angles.13 There was one more distinct feature of Thurstone’s approach that has been easy to lose sight of as factor analysis itself has become subsumed within the larger framework of structural equation modeling. This was Thurstone’s requirement of invariance in factorial structure (see Thurstone, 1947, 360–376). Thurstone viewed meaningful description as the central purpose of factor analysis. In contrast to Harold Hotelling, who had, in 1933, developed the method of principal components analysis, Thurstone did not conceive of factor analysis as primarily a method of data reduction, but as a method for exploring preexisting hypotheses about the factorial structure of psychological attributes. As such, if after conducting a multiple-factor analysis, a particular combination of abilities is being proposed as a valid explanation for performance on a given test in a battery of tests, the criterion of invariance requires that the same combination of abilities will emerge when the test is moved to a diferent battery. Invariance, of course, can be viewed as a prerequisite for comparability, and this implied another implicit critique of Spearman’s g. Under Spearman’s approach, a precondition for the measurement of g was to show that the tetrad equations could be satisfed. Putting aside for the moment Wilson’s critique about the determinacy of g, imagine that the tetrad equations have been satisfed for two mutually exclusive test batteries, x and y, with the same number of tests in each battery. Now say that gx and gy are estimated for the individuals in each battery, and we fnd in comparison between individuals across batteries that ˆg˘ x > gˆ˘ y . How can we be sure that we have really measured a comparable g in each battery? Spearman had suggested that this problem could be solved by choosing one or more common reference tests in each battery. But the success of such an approach hinged on an invariance assumption that Spearman had neither made explicit nor attempted to validate through the many empirical investigations carried out with his students and other collaborators.

Theory vs. Method

243

By 1934, working in close collaboration with his wife, Thelma Gwin, and the students in his Psychometric Laboratory at the University of Chicago, the Thurstones set out to apply their new approach to multiple factor analysis. They had designed a battery of 56 psychological tests to this end, and as Thurstone (1946) would later describe, [t]hese tests were designed so as to represent a wide variety of tasks which had been represented in previous studies of intelligence. Included in this battery were tests which called for verbal comprehension, verbal reasoning, various types of fuency, speed in simple numerical work, quantitative reasoning, various forms of induction, verbal, visual, and auditory associations, visualizing fat fgures and visualizing solid objects, various forms of abstraction with verbal, numerical, and visual material, reasoning about mechanical movements, and memory for diferent types of content. This battery of 56 tests was given to several hundred student volunteers which required about 15 hours of work for each subject. (104–105) The results from this study became the basis for Thurstone’s 1938 book Primary Mental Abilities, in which he argued that there were seven primary abilities that were distinguishable as explanations for individual diferences in test performances. These abilities, presented in the order of the proportion of covariance in the test battery that they explained, were as follows: 1. 2. 3. 4. 5. 6. 7.

Verbal Comprehension (V) Spatial Orientation (S) Inductive Reasoning (R or I) Number (N) Word Fluency (W) Associative Memory (M) Perceptual Speed (P)

On the heels of this study, and with some confdence that the primary abilities satisfed the criterion of factorial invariance (see Thurstone, 1938, 1940), by 1949, the Thurstones had developed a series of tests at several difculty levels that could be given to children in the United States from kindergarten through high school to measure factors V, S, R, N, and W. The battery remained in use through the 1960s.

8.5

Spearman on Defense Do not go gentle into that good night. Do not go gentle into that good night, old age should burn and rage at close of day; Rage, rage against the dying of the light. —Dylan Thomas

244 Theory vs. Method

8.5.1

Responses to Thomson

Spearman did not go gently. However, as time went on, his defense against perceived attacks on the two-factor theory was waged on increasingly mathematical terms. And as this began to happen, attention came to focus increasingly on diferent methods of analyzing correlation matrices, as opposed to the primacy of the two-factor theory, and as Lovie and Lovie (1995) have argued, this led to Spearman losing intellectual ownership of the debate, especially following his retirement from University College London in 1931. Recall that the twofactor theory had begun as a relatively simple deductive argument that followed from the application of Yule’s formula for a partial correlation. If the theory were true, a pure and predictable hierarchical relationship of proportional correlation coefcients would follow. The problem was that the correlation coeffcients that were observed in empirical studies never ft the predicted hierarchy exactly, and as always there were two competing explanations why: fallible data or a fallible theory. Perhaps unsurprisingly, the argument that Spearman consistently favored over his career was that diferences between predicted and observed could, in a preponderance of cases, be attributed to sampling error in the choice of participants for the study or to a mistake made in the design of the study. Putting aside the second of these possibilities, establishing the frst one clearly depends on the validity of a statistical criterion used to quantify sampling error, and through the early 1920s, Spearman faced consistent criticism from Karl Pearson, William Brown, and Godfrey Thomson in this regard. So when Godfrey Thomson was able to show that an alternative theory was compatible with the Hart and Spearman (1912) criterion for comparing an observed correlation matrix with a predicted one, Spearman enlisted the help of the mathematical statistician Maxwell Garnett to make a stronger case connecting a hierarchical correlation matrix to the two-factor theory (Garnett, 1920) and enlisted the help of a talented student, Karl Holzinger, to tackle the problem of devising a more defensible statistical criterion for the evaluation of a hierarchical correlation matrix (Spearman & Holzinger, 1924, 1925, 1930).14 In this sense, Spearman was successful in convincing Thomson of the existence of a number of studies with correlation matrices that were, at least, consistent with the two-factor theory when evaluated using Spearman and Holzinger’s improved statistical criterion. But while Thomson was willing to concede that a collection of mental tests could be used and interpreted descriptively for practical purposes “as if ” there was a general factor that described their intercorrelations, he remained steadfast in the conviction that his sampling theory represented a theoretically superior explanation for the intercorrelations:15 This theory is preferred because it makes fewer and less special assumptions, because it is more elastic and wider, and because it is in close accord with theories in use in biology and in the study of heredity. (Brown & Thomson, 1940)

Theory vs. Method

245

An interesting aspect of Spearman’s debate with Thomson on this matter is that it predated the Neyman–Pearson conceptualization of signifcance testing which formalized the concept of both a null and alternative hypothesis and, through this, the concept of statistical power. As most of the empirical studies involving the analysis of mental test correlations had tended to have small samples through the 1920s, the power to detect tetrad diferences greater than zero in absolute value was likely rather small. Thomson recognized early on that larger samples would be likely to show that observed correlations could never do better than ofer a very close approximation to a true hierarchical arrangement, and as such, his sampling theory would always be a defensible explanation for data that Spearman could ofer up as proof of the two-factor theory. Spearman, for his part, was willing to acknowledge the mathematical argument that g could be reexpressed as a function of smaller independent variables, but as this meshed neither with his model of human cognition nor with the purposes for which intelligence and aptitude tests were administered, he saw little point in such reexpression.

8.5.2

Responses to Wilson

In a fascinating account of the public and private exchanges between Spearman and Wilson that took place between 1928 and 1933, Lovie and Lovie (1995) make the case that Spearman’s response to Wilson’s critique was as much a negotiation as it was a defense. In his initial published response to Wilson, after agreeing that g could only be determined “within limits” (i.e., within the bounds of a standard error of measurement, a point Spearman had acknowledged in the appendix of The Abilities of Man, although it had not been stressed), Spearman (1929) ofers a rather telling confession: Wilson rather reproaches me with not bringing forward actual instances of g being satisfactorily determined. For, by reason of what both of us have said, such instances have not been furnished by other workers in this feld. And as for making the required determinations myself, actual measuring has so far not been my job; I have only tried to show others how to do it. (213–214, emphasis added) The initial “solution” to the determinacy problem that Spearman (1929) proposed was to suggest that for any battery of tests being used to measure g, one additional test known to correlate perfectly with g could be added, declaring that “unpublished work in our laboratory has more than once obtained for an ra g values of 0.99.” (214). Of course, if a single test is a perfect measure of g, then there is no need for a collection of tests and no need to examine correlation hierarchies. All we need is the test score, and if a psychological argument could be given for why this single test was a valid measure of g, then, in Wilson’s

246 Theory vs. Method

words, “fne!” Wilson (1929) must have been enjoying himself during his rejoinder, when as part of a parenthetical response he noted, As we approach the ideal condition, ra g = 1, the specifc part of a diminishes; when ra g = .99, 1 − ra2g = .14 so that only fourteen percent of specifc ability remains and that may be considered insignifcant as Spearman seems to imply; being no psychologist I can hardly judge. Incidentally I may express my satisfaction and admiration that a method has been found that will make ra g so large as 0.99. I hope soon to see this work in print and shall look forward with particular interest to some discussion of the annoying question as to how much of the unreliability of this test a should be attributed to variability in the g’s of the individuals at diferent times of mental testing, as their weights or pulse-rates might and would difer, and how much should be attributed to still outstanding imperfections of the method of measurement. (222) The unpublished work Spearman had referenced never did make it to print. What Spearman would eventually propose with some help from Garnett (1932), and following an out-of-the-blue contribution by Piaggio (1933), was that while g could never be exactly determined for any given individual, intervals defned by standard errors of measurement could be made increasingly precise by adding more tests to the battery, an idea quite similar in spirit to the Spearman–Brown formula discussed in Chapter 6. To Spearman then, the issue of g’s determinacy was no more problematic than the determinacy of any physical measure, as these are also subject to measurement error, and this means we can never be absolutely sure if the measure we make is equal to the true value of interest. The question is whether the uncertainty is (or can be made) small enough to be considered negligible. Spearman would claim that the actual amount of measurement error in estimates of g from typical test batteries was not so large to prevent its use for classifcatory purposes (Spearman, 1934), and Brown and Stephenson (1933) suggested that with 19 tests in a battery (as opposed to the three that Wilson had used to illustrate the problem in Table 8.5) the indeterminacy of g due to measurement error would be negligible.16 The debate about factor score indeterminacy generated considerable heat throughout the 1930s and then largely disappeared from view for 30 years before it reemerged again, briefy, in the 1970s (Steiger, 1979; Steiger & Schöneman, 1978). As for Wilson’s second critique related to the admissible transformation of test scores, here Spearman found himself in quicksand, and his initial attempt to extricate himself did more harm than good. Wilson had granted the natural objection, one that Spearman was quick to voice, that scores ma x , mb x , and mc x are the results of an intentional design to measure something related to intellectual ability, while there are is an infnite set of scores ma ′x , mb ′x , and mc ′x that can be formed

Theory vs. Method

247

through blind mathematical constructions. If, for example, the three hypothetical tests in Table 8.5 were tests of pitch discrimination, spelling, and single-digit multiplication, why would it be sensible to create three new test scores as differently weighted composites of these three original test scores? But consider the problem more carefully. The score on a spelling or multiplication test will typically be composed of many different items, and a student’s score on the test will depend on the particular set of items chosen for the test. Even if items have been designed ahead of time to fall into unique, homogeneous, and mutually exclusive classes, then unless all these classes are known and specifed in advance, there will always be implicit weights underlying an observed test score, and the nature of these weights will depend on the way that items are being implicitly sampled. By this reasoning, test designers are forming composites all the time, and the implicit weighting scheme will rarely be evident to the secondary analyst looking for correlational patterns. Spearman had also argued, on the grounds that negative w’s could be given no psychological justifcation, that Wilson should impose the constraint of only positive transformation constants. He hoped that this would reduce the number of admissible transformations that would “conserve” the g that emerged from the original measurements. But in fact, Wilson demonstrated that such a constraint would literally make it impossible to conserve g. The negative coefcients were not arbitrary; they were actually necessary conditions. Spearman’s only path out of this box was to constrain the number of new test scores eligible for creation to be less than the original set, k. With this requirement in place, it would be possible to restrict all transformation coefcients to be positive, and Spearman viewed this as being in close accord with standard test construction practices. It is in this sense that Spearman was convinced the matter was resolved, although it was clearly not a conviction shared by Wilson17 (Spearman, 1934; Lovie & Lovie, 1995). In his two posthumous publications, where one might turn to fnd Spearman’s fnal word on the two-factor theory, issues related to the determinacy of g are greatly downplayed if they are mentioned at all (Spearman, 1946; Spearman & Jones, 1950), and, even more generally, in the modern literature of factor analysis (and its generalization, structural equation modeling), the moral of Wilson’s transformation critique—it all depends on the setup—is easy to overlook.

8.5.3

Responses to Thurstone

Ultimately, it was Thurstone’s generalization of multiple-factor analysis, and his program of research that accompanied it, that did the most to undermine not only the two-factor theory but also the methods Spearman had developed for studying it. Spearman (1939), although in his early 70s when Thurstone began to publish his work on multiple-factor analysis, certainly did not shy away from the challenge. Spearman immediately took issue with Thurstone’s approach to

248 Theory vs. Method

factor rotation and what he regarded as the vague and subjective criterion of simple structure as a basis for allocating tests to distinct factors. The crux of Spearman’s argument was that by introducing multiple factors, rather than explaining additional variability, all Thurstone was accomplishing was a transformation and reapportionment of g into a larger number of factors. To a great extent, Spearman was right on this point, since Thurstone’s discovery of his seven primary abilities had come from a factor analysis with the constraint of an orthogonal rotation.18 When he instead allowed for an oblique rotation of the axes, this allowed for correlations among his primary ability factors, which could, in turn, be subjected to a new factor analysis. The result of this secondary factor analysis recovered a single common factor that could, as Thurstone eventually conceded, be interpreted as the reemergence of g. This debate, much like the one with Thomson over his sampling theory of ability, again hinged not on the mathematics but on the theoretical interpretation. Was g a composite formed through a linear transformation of primary abilities (Thurstone’s view)? Or were the “primary abilities” simply an equivocal and less reliable reexpression of g (Spearman’s view)?

8.6 Spearman’s Legacy In the last chapter, Spearman’s theory of two factors was introduced as a linear mathematical model in which any observed test score in some collection of mental tests could be cast as the combination of two latent factors, one that was general to any test, which became known as g, and another that was specifc, s. The correlation between g and each observed test, known in modern factor analysis as a loading, provides an indication of how much each test is “saturated” with g. Estimates of these loadings were derivable from the matrix of intercorrelations between the available tests through an application of Yule’s formula for a partial correlation, and with these in hand, g and s could be estimated as well. Next, we considered the method that Spearman introduced to corroborate his theory through an appraisal of the observed correlation matrix. Spearman had deduced that if the theory was true, it should be possible to arrange the correlation coefcients into a matrix that followed a proportional hierarchy. Rather than evaluate the matrix by inspection, Spearman proposed a formal test through the tetrad equations that could be formed by various combinations of four correlation coefcients in the matrix. If the distribution of observed tetrad differences could be shown to be no larger than one be predicted by sampling error, it could be taken as a corroboration of the two-factor theory. Spearman was intent on situating the two-factor theory into a larger, speculative “meta” model of human cognition, a model that he hoped could be refned over time to establish accepted quantitative and qualitative laws that could explain individual psychological diferences. Within this meta model, Spearman interpreted g as a form of mental energy as associated with neuronal activity, s as the engines

Theory vs. Method

249

within the cortex dependent on this energy, and conative variables such as motivation as that which was needed to direct the energy from one engine to another. Spearman thought the two-factor theory could be used to design better batteries of intelligence tests and that these tests could, in turn, be used in support of some combination of meritocratic principles for social improvement. If Spearman devoted much thought to what it meant exactly, to characterize his correlational analysis as measurement, he gave little indication in his writing. Generally speaking, Spearman equated measurement with the objective procurement of numeric values that could be used to test quantitative theories mathematically. In the present chapter, we have examined three of the biggest challenges to the two-factor theory. In two of these instances, the challenges came by way of competing theories of mental ability. Thomson demonstrated that a sampling theory of ability could be shown to produce correlational patterns among test scores that would be indistinguishable from the patterns Spearman had interpreted as corroboration of the two-factor theory. Thurstone, in developing a more general and exploratory approach to factor analysis, found empirical support for multiple common factors that he interpreted as primary cognitive abilities. Both of these challenges point to the inherent difculty of establishing a causal hypothesis (i.e., the refective measurement of a latent variable) through correlational analysis. Wilson’s critique of the indeterminacy of g raises some deeper questions about its measurability, questions I should note, that would apply with equal force to the multiple factors in Thurstone’s model. In particular, Wilson’s concern about the results being sensitive to “the setup” suggests that the inputs in factor analysis, the underlying test scores, cannot just be taken on face value. The debates that Spearman took on with Thomson, Wilson, and Thurstone (among many others) could be taken as evidence of a working Hegelian dialectic of scientifc advance. That is, while it is generally believed that by the late 1930s, there were two distinct “schools” of factor analytic traditions, one headed by Spearman and the other by Thurstone, in point of fact, both schools had come to similar fndings in their studies of individual diferences in cognitive abilities. Spearman had gone about this by following a confrmatory program of research that, over time, led him to modify or amend the two-factor theory when it did not agree with empirical evidence. By the time his posthumously published update to The Abilities of Man, titled Human Ability, was published in 1950 (Spearman & Jones, 1950), Spearman had fully accepted (if not embraced) the existence of group factors, and indeed, some of the group factors that he named, such as a verbal factor, a word fuency factor and a spatial factor, bore a fairly striking similarity to three of a seven primary factors Thurstone had identifed. It had been in the process of generalizing Spearman’s methodological approach that Thurstone was able to establish a more encompassing and beftting mathematical framework for the study of individual diferences, one in which Spearman’s method

250 Theory vs. Method

of analyzing a correlation matrix could be cast as a special case. But in parallel fashion, even though the multiple-factor method had led him to embrace an exploratory approach and a theory of primary abilities, by 1949, Thurstone had conceded that g could be reconstituted from these abilities by conducting a hierarchical factor analysis. The bifactor model proposed by Holzinger and Swineford (1937) could be viewed as a compromise between the two positions (also see Schmid & Leiman, 1957; Yung, Thissen, & McLeod, 1999; Reise, 2012). It is remarkable to ponder all that these pioneers of quantitative psychology were able to accomplish during the frst half of the 20th century marked by ongoing social upheaval and the international crisis of two World wars. The price that was paid was heaviest on the eastern side of the Atlantic. In 1914, all research in Spearman’s psychological laboratory related to the two-factor theory was suspended, replaced by studies on night vision and auditory discrimination that might contribute to Great Britain’s success in the war. During this time, Spearman actually returned again to serve in the army (at the age of 51), and William Brown was placed in charge of a hospital for military ofcers sufering from physical and mental exhaustion.19 Thomson, for his part, had actually written his frst critique of the two-factor theory and introduced the counterhypothesis of what became his sampling theory of ability in 1914, but its publication and Spearman’s response had to be delayed two years because of the ongoing war. From 1916 through 1938, between the two World Wars, Spearman had enjoyed a fulflling professional career (maintained even after his ofcial retirement from the University College London in 1931), a career that he seems to have balanced with time spent enjoying the company of his large family and frequent opportunities to play tennis. All this was disrupted by World War II. Spearman and his family were evacuated from London to the countryside during the bombings of 1939, and in 1941, Spearman’s son, a naval engineer, was killed during an evacuation following the Battle of Crete. One can only imagine the toll all this must have taken on Spearman. He died at the age of 82 in 1945 after falling from the window of his upper story hospital room following a period of deteriorating health.20 The testament to Spearman’s impact was already evident in the obituaries that were published within a year of his death by former students, collaborators, and rivals alike. Most prominent among these were the obituaries written by Raymond Cattell,21 Kurt Holzinger, Cyril Burt,22 Edward Thorndike, and Godfrey Thomson. Cattell (1945), at this point an accomplished professor of psychology at the University of Illinois, makes the astute observation that [o]ne of the most fascinating aspects of Spearman’s mind and work, for the historian and the investigator of investigation, is the great discrepancy between what Spearman thought he was doing and what this generation of psychologists thinks he was doing. (88)

Theory vs. Method

251

It is an observation that continues to resonate in modern times, and it likely applies not only to new generations of psychologists but also to many outside psychology who have taken up the methods that Spearman inspired. To the extent that they are aware of the historical context that inspired Spearman’s methods at all, they are likely to imagine a man who was single-mindedly dedicated to promoting a simplistic version of a method, factor analysis, that was inextricable from a correspondingly simplistic theory of human intelligence. Spearman, to be sure, did a lot stylistically that contributed to this caricature (Dreary, Lawn, & Bartholomew, 2008). He was indeed single-minded when it came to his program of research and often destructively combative with those who raised questions about it. But, as described in the previous chapter, Spearman’s theory of general and specifc abilities did not exist in isolation. Rather, it was part of a larger model that Spearman was contemplating to establish laws that would predict diferences in human cognition. To Spearman, the goal was to establish laws that would be as “fundamental to psychology as Newton’s laws were to physics” (Cattell, 1945, 88) and with this in mind, he regarded the two-factor theory as one especially interesting piece of this larger puzzle—interesting because it involved quantities, g and s, and the occasional group factor that were amenable to measurement and interesting because one of these quantities, g, always managed to turn up in batteries with three or more test score. What Spearman thought he was doing was scientifc inquiry consistent with an idealized hypothetico-deductive approach. In 1904, he had gathered data relevant to a phenomenon that was not well understood and formulated a hypothesis (the two-factor theory) that could explain the patterns he had observed (a hierarchical correlation pattern). From this he deduced consequences from his hypothesis that could be used to corroborate it with new data (the tetrad equations). He then sought out new experimental data, and if the new data failed to support this theory, he either scrutinized the method of falsifcation or modifed the theory by setting new conditions for its corroboration (dissimilarity of tests, existence of groups factors). There is surely some truth to this characterization, as one need only trace the theory of two factors across four key landmarks, from its origin in Spearman’s original 1904 publication to its rearticulation in Hart and Spearman (1912) to its most complete treatment in The Abilities of Man (1927) and fnally in the update published after Spearman’s death (Spearman & Jones, 1950). Over this time, we can fnd numerous instances in which Spearman modifed both the theory and the methods for corroborating it. Spearman could point out that he had not only anticipated the need to revise and refne the theory in the face of new evidence23 but had actually done so.24 And yet, there is a sense of unresolved frustration with Spearman that is palpable in the obituaries written by Thorndike (1945) and Thomson (1947). In reading Spearman’s body of work, there is no escaping the sense that he often seemed more interested in winning the argument than in genuinely trying

252 Theory vs. Method

to understand the argument.25 Spearman typically adopted one of three tactics in the face of perceived criticism of the two-factor theory. The frst tactic was the fnd the weakest point in an argument, magnify it, and compose a written response marked by condescension, sarcasm, and passive aggression.26 The second tactic, sometimes used in tandem with the frst, was to accept the validity of the critique but to argue that he had already anticipated it and point the ofending author to the passage and publication that had been overlooked. The third tactic, surely the most annoying, was to give the impression that the criticism, if read correctly, was actually a vindication. Sure enough, if one were to only read Spearman’s publications, one would come away with the impression that in the end, if Thomson, Thorndike, Kelley, and Thurstone were taken by their actions and not just by their words, they had all come around to a slightly modifed version of the theory of two factors. In fact, what Thomson, Thorndike, Kelley, and Thurstone never came to accept was Spearman’s theoretical explanation for positive hierarchical correlations among mental tests, both in terms of his notion of g as mental energy and in terms of his noegentic laws. They were willing to grant the existence of g as a statistical construct, one that could surely prove useful for a variety of predictive purposes but one that had little to ofer in the way of coming to a better understanding of psychological laws. Where there was unambiguous admiration for Spearman was in his ability to devise mathematical treatments of psychological problems.27 But in his need to publicly respond to every perceived criticism of the two-factor theory, did Spearman do as much harm as good? Instead of promoting a hypothetico-deductive approach to psychological science, had Spearman instead unwittingly played the part of the stubborn scientist blinded by preconceptions, who saw only what he wanted to see?28 The answer is that Spearman had done both. The proclivity to see what one is predisposed to see in the conduct of science is surely the rule, not the exception. What we should ask instead is whether Spearman was transparent in the evidentiary basis he used to test and modify his theories and whether he showed a willingness to submit his claims to the public scrutiny of his peers. In this, notwithstanding a lack of modesty and humility, his conduct was exemplary.29 It was Spearman who most consistently demanded that arguments about methodology be grounded in theory, not the other way around. In this way, he helped foster an atmosphere of productive tension that, for a time at least, kept the measurement of psychological attributes attached to the broader study of human psychology.

8.7 Sources and Further Reading Those wondering about developments in the factor analytic study of human intelligence following the debates among Spearman, Thomson, and Thurstone, an excellent resource to consult is the 2007 book Factor Analysis at 100: Historical Developments and Future Directions, edited by Robert Cudeck and Robert

Theory vs. Method

253

MacCallum. In particular, see chapter 11 “Understanding Human Intelligence Since Spearman” by Horn and McArdle, who argue that subsequent empirical research has largely failed to corroborate the existence of g as a pervasive form of mental energy present to a greater or lesser extent in all acts of intelligence. Instead, the abstract reasoning ability that Spearman had found to be associated with tests loading strongly on g (i.e., eduction of relations and correlates) can now be identifed by what Cattell (1963) described as “fuid reasoning” (Gf ) in contrast to “crystallized knowledge” (Gc). Extensions of the Cattell’s Gf–Gc theory have come to identify as many as nine distinct second-order factors with utility for the purpose of characterizing diferences in human intelligence, and Horn and McArdle point out that this continues to be an evolving line of research. See Nisbett et al. (2012) and Kovacs and Conway (2016) for examples of interesting 21st-century developments. Horn and McArdle cover some of the same ground I have covered in these last three chapters. Other biographical accounts of various important episodes in Charles Spearman’s career can be found in Bartholomew (1995), Batholomew et al. (2009b), Dreary et al. (2008), Fancher (1985a, 1985b), Norton (1979), and the three articles by Sandy Lovie and Pat Lovie provided in the references. I found Spearman’s (1930) relatively short autobiography very interesting. Although Spearman’s most infuential publications were his two 1904 papers, they are not especially easy to read and therefore not the best place to start in making sense of Spearman’s ideas about human abilities and the use of factor analysis as a tool for measurement. Instead, start with Spearman’s 1927 book The Abilities of Man: Their Nature and Measurement; all the mathematical details relevant to Spearman’s factor analytic approach are contained in the appendix of the book.

APPENDIX Simulating Thomson’s Sampling Theory Model

As ingenious as Thomson’s simulation was—and more than 100 years later, it remains an exemplar—we can only wonder how much more quickly he would have been able to strengthen his argument with access to modern computing power. Just simulating a single small data matrix and performing the computations described earlier through the drawing of cards and rolling of dice would have likely taken Thomson and a team of assistants days of work. Today, with a little bit of computer code, a laptop computer can generate the same data and replicate the entire process a thousand times in a matter of minutes. An R script fle written for this purpose is available upon request.30 Instead of testing for hierarchical structure in the resulting correlation matrices using Spearman’s statistical criteria, we can use matrix algebra to extract eigenvalues, form a scree plot, and conduct a parallel analysis (Horn, 1965; Cattell, 1966). Figure 8.3 shows the results from running a parallel analysis on the frst random iteration of Thomson’s procedure and provides evidence that would be taken as strong evidence of a single common factor. Next, Thomson’s procedure is replicated 1000 times, and I take the ratio of the diference between the frst and second eigenvalues and the second and third to mimic the sort of analysis that would be performed as part of a scree test (Cattell, 1966). A rule of thumb that is often used to conclude that the correlations can be adequately explained by a single general factor is a ratio of 3 or higher. In my simulation, I fnd that a full 89% of time, this ratio exceeds 3. All this bolsters Thomson’s central point, which is that even when data have been generated to correspond to a random sample of individuals taking tests composed of a random assortment of overlapping group factors, we are highly likely to conclude that a single general factor represents an adequate explanation for test score intercorrelations.

×

255

PC Actual Data PC Simulated Data FA Actual Data FA Simulated Data

2

3

4

×

1

×

×

×

×

×

×

×

0

Eigenvalies of Principal Components and Factor Analysis

Theory vs. Method

FIGURE 8.3

2

4

6

8

×

×

10

Factor/Component Numner

Parallel Analysis of Thomson’s Simulated Data.

Notes 1 Lovie and Lovie (1993) argue that this was not exactly an independent corroboration. Correspondence between Spearman and Burt leading up to the publication of Burt’s study indicate that Spearman played a very active role in shaping the analysis. 2 See www.ces.ed.ac.uk/old_site/SSER/about/test.html. 3 Thomson refned his new theory over a long series of articles: Thomson (1916, 1919a, 1919b, 1919c, 1920a, 1920b, 1924, 1927, 1934, 1935). The best summaries can be found in Brown and Thomson (1940) and Thomson (1951). Most of them were direct rejoinders to Spearman’s contributions to the debate: Spearman (1916, 1920, 1922a, 1923b, 1940). 4 At this point, Thomson showed that it was already possible to calculate the theoretical  intercorrelations between each pair of tests by using the formula Number of overlapping factors (see Brown & Thomson, 1940, 176 for details). For r= Geometric mean of total factors 2 example, the correlation between tests 2 and 3 would be r = = .22. However, 10 * 8 I just focus on the simulation of observed correlations that was Thomson’s last step. 5 The criterion that Hart and Spearman had developed involved computing the intercorrelation of columns of the correlation matrix and showing that these were statistically indistinguishable from 1. Thomson argued that this was the wrong criterion to be using because it was biased in favor of the hypothesis of hierarchical structure. See Brown and Thomson (1940, 179–183). 6 In his connectionist theory of learning, Thorndike had used the term bonds to characterize that which is formed in the brain through the experience of stimulus–response pairings. Thomson knew Thorndike well after having spent a year as a visiting professor at Columbia University in 1921 at Thorndike’s invitation, so it seems that Thomson applied the term in his honor. However, it also seems clear that Thomson

256 Theory vs. Method

7

8

9

10 11

12

13

14 15 16

17

intended a much more general interpretation of the term, as he believed that neural bonds could be both inherited as well as developed through experience and education. In hindsight, because Thomson thought it might be most realistic to conceive of each bond as either “on or of” in any particular individual, he might have been better served to fip coins rather than rolling dice in this stage of his simulation or to have at least dichotomized the dice values. We can see here that there are many variants Thomson could have applied for his simulation and a strategy that Spearman took in his subsequent published debates with Thomson was to accuse him of either cherry-picking a single outcome from a single simulation variant most likely to approximate a hierarchical correlation matrix or to have chosen a variant in which the sampling analogy was logically implausible (Spearman, 1916, 1920, 1922). Thomson, like many statisticians of the past, present, and future, did a great deal of hand-waving to justify the treatment of an available sample of convenience as if it were a random sample from a defned population for the purpose of estimating sampling errors. See Brown and Thomson (1940, 183–188). The gist of Thomson’s argument is that a collection of bivariate correlations from the same battery of tests will themselves tend to be correlated, and this can be explained in large part by the joint sampling of people and tests. See also chapter 20 in Thomson (1951). Steiger (1979) describes Spearman’s regression approach as the “construction approach,” in contrast to Wilson’s “range of possible solutions” approach. As examples of tests that would fail this criterion, Spearman (1927a) gave as examples academic measures with common content (Latin translation and Latin grammar; French prose and French dictation) and sensory measures that invoked a common process (counting letters in a passage one at a time vs. three at a time). That is, given an n by k matrix A with rows composed of people and columns composed of test scores, one could always form a new matrix A′ by multiplying A with a k-by-k transformation matrix composed of subjectively defned weights. It is outside the scope of this chapter to delve into the details of Thurstone’s approach to exploratory factor analysis. However, this is easy to fnd in other places. Thurstone himself gave a complete exposition in his masterful 1949 book Multiple Factor Analysis. Another terrifc presentation can be found in Thomson (1951). For a nontechnical primer on the concept of rotation, see Gould (1981). For a more recent textbook treatment, see Gorsuch (1983). This criterion was the standard error for a tetrad diference. See Bartholomew, Deary, and Lawn (2009a) for a modern application of Thomson’s sampling theory. This was a disingenuous argument in the sense that, just as there are practical limits to the length of any single test composed of items that would follow the Spearman–Brown formula, there are surely limits to the number of tests that could be realistically included in a battery. After all, not only are their limits in the time available to administer additional tests within a fnite period, but as more tests are added, it becomes increasingly likely that they will have overlap in their specifc factors. So the additional tests may, paradoxically, undermine the very attribute they are intended to measure with greater precision. Wilson published three additional commentaries in the Proceeding of the National Academy of Sciences in the same year as Spearman’s 1933 attempt to present a consensus position (Wilson, 1933a, 1933b, 1933c). In these it is evident that the crux of

Theory vs. Method

18

19

20

21

22

23

24

257

his concerns about the invariance and uniqueness of g remained, though he does not explicitly respond to Spearman’s comments. As Lovie and Lovie (1995) suggest, it appears that Spearman and Wilson had come to a gentleman agreement of some sort. In Thurstone’s autobiography, he attributes this decision to the advice he had received from Thorndike and Kelly. “Although my frst text on multiple-factor analysis, The Vectors of Mind, had previously been published (1934), with a development of the concepts of communality, the rotation of axes, and the use of oblique axes, I hesitated to introduce all of these things in the frst experimental study. In particular, there was strong advice from Thorndike, Kelly, and other men for whom I had respect, that an oblique reference frame would be completely unacceptable. Instead of proceeding according to my convictions, that frst factor study was published with the best ftting orthogonal frame, although we knew about more complete methods. This was an efort to avoid the storm of controversy that we feared in the introduction of so many diferent procedures in the frst experimental study” (Thurstone, 1952, 316). Brown’s experiences in this capacity had a major impact in the direction of his career, as he became increasingly interested in clinical problems related to psychotherapy, the study of personality, and, later, essays on war and peace. According to interviews with Spearman’s remaining family by Lovie and Lovie, it was thought that the fall was not an accident but a purposeful decision. In the years following his son’s death in 1941, Spearman began experiencing frequent blackouts, and his death was precipitated by a case of pneumonia. Cattell had interacted and studied with Spearman as a graduate student at the University College London in the late 1920s. Cattell’s PhD advisor was Spearman’s close friend Francis Avelling. It is a bit tricky to characterize Burt’s relationship to Spearman. Was he a rival or a former collaborator? Although his obituary of Spearman was generally laudatory and echoes many of the notes found in the obituaries of Cattell and Holzinger, it is notable that already here we see signs of Burt downplaying Spearman’s status as the originator of factor analytic methods. See Lovie and Lovie (1993). Spearman (1928), for example, in defending the theory of two factors from an attack by Pearson, would point to a sentence on page 160 from The Abilities of Man in which he had written, “As originally, so now once more, the plea must be urged that all conclusions drawn in the present work are subject to ‘inevitable corrections and limitations’” (95). At the same time, these qualifcations were often swamped by surrounding rhetoric that was much more grandiose. Similarly, with group factors, he acknowledged them but tended to play down their importance. Although he devotes a chapter to group factors in The Abilities of Man, he concludes the chapter by asserting that “cases of specifc correlations or group factors have been astonishingly rare” (Spearman, 1927a, 241). A great example of this can be found in the comment published by Spearman following the publication of Kelley’s 1928 book, Crossroads in the Mind of Man. Kelley, much like Thorndike (1924), had essentially argued that evidence of a verbal factor could be used to reject the two-factor theory. Spearman (1929b, 562) objects, writing, “In a very large number of subsequent publications, it [the two-factor theory] has continually been rendered more precise, more complete, and more secure from possible objections.” Regarding the verbal factor in particular, Spearman writes, “I am willing to concede that in the last year or so my own views have received more development on this point than on any other” (Spearman, 1929b, 563).

258 Theory vs. Method

25 Nowhere was this truer than when it came to his feud with Karl Pearson. See Lovie and Lovie (1995). 26 Although they could be extremely entertaining. See how he sets up one of his recurrent punching bags, Walter Dearborn, in Spearman (1931, 402): To begin with, I should like to congratulate him on his choice of objective. Certain critics of our theory of Two Factors make a pretence of attacking it fundamentally when in truth they are only dealing with unimportant details; what is really essential in our theory they tacitly appropriate to themselves (mostly under new names). Not so Professor Dearborn. The point of attack chosen by him is both novel and vital. He would appear to charge us with nothing less than plagiarism; the very hub on which our theory rolls, the general factor itself, this he accredits to our earliest antagonist, Binet. 27 Ironically enough, Spearman’s least controversial and universally appreciated contribution made little attempt to connect math directly to a psychological problem. This was Spearman (1913) in which he introduces derivations for formulas that can be applied when taking the correlations of sums and diferences. 28 Thorndike didn’t go this far, but his mixed feelings about Spearman were obvious. “He and psychology sufered from intensity of his drive to protect and promote his theories. It would have been better if he had let the two factor theory fght more of its battles unaided. .  .  . If I thus regret that he did not do what he might have done, it is because I value so highly for psychology and for my own development what he did do” (Thorndike, 1945, 560). 29 By far the most humanizing portrait of Spearman comes from Karl Holzinger’s obituary. Holzinger remembered Spearman as a man of great energy, generous with his students, and loving to his family. In contrast to the persona that came across in Spearman’s writing, Holzinger remembered him as man with charm and a great sense of humor. Among Holzinger’s many anecdotes, the one that most countered my own preconceived image of Spearman was his preference to be lodged at the International House during his visits to the University of Chicago because he so enjoyed the opportunity to chat with a diverse mix of students in the lounge and dining room (Holzinger, 1945, 232). 30 The script fle has been modifed slightly from a publicly available source fle written by Cosma Shalizi, an associate professor in the Statistics Department at Carnegie Mellon University.

9 THE SEEDS OF PSYCHOMETRICS Thurstone’s Subjective Units

9.1

Overview

To the extent that there was a feld of study emerging in the early 20th century which took the measurement of human attributes as its focus, it was a feld that was a bit hard to pin down. The clearest direct descendant of Galton’s program of study around individual diferences and heritability was Karl Pearson, who described what he was doing as “biometrics” and founded the journal Biometrika as an outlet for the line of work. The infuence of Weber and Fechner was most clearly evident in the emergent feld of experimental psychology spearheaded by Wilhelm Wundt through the formation of his laboratory at the University of Leipzig. That laboratory became something of a factory for the production of psychology PhDs. A more emergent specialization was educational psychology, whereby one could fnd some attempts to marry the approaches to measurement that Galton and Fechner had helped introduce. It was within this domain that Thorndike, Spearman, and Binet, among others, would emphasize the role that measurement could—or should—play in the study of human cognition, learning, and development. The explosion of interest in intelligence testing between 1910 and 1930 also had the efect of creating a de facto subfeld within educational psychology focused on mental testing and a subfeld within this subfeld with a primary focus on tests of academic achievement. This niche became known as educational measurement (Lindquist, 1951). What was also about to come into existence in the early 1930s was a feld of study and practice that overlapped whatever boundaries existed between the emerging traditions of experimental psychology, educational psychology, and educational measurement. It would go by the name of psychometrics. And the person whose work and infuence did the most to breathe psychometrics to life and give it an identity was Louis Leon Thurstone. DOI: 10.1201/9780429275326-9

260 The Seeds of Psychometrics

As we saw in Chapter 3, it was Galton who had defned “psychometry” as “the art of imposing measurement and number upon operations of the mind” (Galton, 1879b, 149). The active work along these lines came primarily from the psychophysics tradition, where Müller (1879) referred to curve-ftting activities associated with the constant method as ftting a psychometric function. However, I would argue that a feld concerned with “psychometrics” really only came into being in 1936, with the publication of J. P. Guilford’s book Psychometric Methods and with the founding of the Psychometric Society and its fagship journal Psychometrika. As Guilford would point out, two distinct approaches to psychological measurement had emerged in the early 1900s, one premised on the experimental methods of psychophysics and the other on the correlational study of individual diferences through the administration of mental tests. Any astute reader would have noticed that Thurstone was a common denominator between the approaches presented in Guilford’s chapters. In Thurstone’s cumulative work, we can fnd a synthesis of the methods and conceptualizations of measurement for human attributes that had been introduced by Fechner, Galton, Binet, and Spearman. At the same time, Thurstone avoided and/or purposefully rejected aspects of these methods and conceptualizations that had inspired the most controversy (e.g., Fechner’s metaphysics, Binet’s age units, and Spearman’s dogmatism) or, in the case of Galton’s eugenics, would eventually inspire the most disdain. Thurstone had the enviable ability to take stock of the prevailing methods of inquiry in education and psychology, prune what he regarded as their weak points, place them within a more general mathematical framework, and from this framework not only establish new methods but also apply them to study a broad array of timely social issues such as the structure and development of intelligence, societal attitudes, and even the efect of propaganda on these attitudes. To do this in a single area would be remarkable enough, Thurstone did it in at least three: factor analysis as inspired by Spearman; the scaling of mental tests as inspired by Galton, Binet, and Thorndike; and psychophysics as inspired by Fechner. In this chapter, I focus attention on the latter, because it represents the clearest articulation of Thurstone’s conceptualization of psychological measurement.

9.2 Thurstone’s Background Louis Thurstone was born in 1887, the second of two children from parents that had immigrated to the United States from Sweden as the Thunströms. His father eventually changed the family name from Thunström to Thurstone because no one could pronounce the original correctly. From an early age, Louis Thurstone had demonstrated talents as an engineer that predisposed him to seek out quantitative solutions to social problems. As a teenager, he

The Seeds of Psychometrics

261

Louis Leon Thurstone (1887–1955) and Thelma Gwinn Thurstone (1897–1993).

FIGURE 9.1

Source: © Getty Images.

had already succeeded in having a letter published in Scientifc American.1 By the time he was 25, he had come up with a novel design for a movie camera and pitched it to Thomas Edison, who turned down the idea, but did give Thurstone an internship. Edison made an impression; Thurstone would later recollect that “for every experimental failure he seemed to produce three more experiments to try” (Thurstone, 1952, 299). Thurstone earned a Master of Engineering at Cornell University and along the way had sat in on a lecture by E. B. Titchener in the psychology department. Titchener, an émigré from the United Kingdom, was the foremost ambassador of Wundt’s experimental psychology, and, within this, the methods of psychophysics. Thurstone later recalled as his lasting memory that while the topic of Titchener’s lecture was interesting, his delivery had been “extremely formal and pompous” (Thurstone, 1952, 298). In this sense the delivery was in synch with the material in Titchener’s textbooks (e.g., Titchener, 1905), which

262 The Seeds of Psychometrics

consisted predominantly of rules and procedures for the conduct of psychological experiments, with all the examples coming in the context of physiology and sensation. It was unlikely to be the sort of thing that captured the imagination, and it would be another 15 years before Thurstone would revisit the topic. In 1914, he enrolled in the PhD program at the University of Chicago, and midway through completion, Thurstone was recruited to the Department of Psychology at the Carnegie Institute of Technology in Pittsburgh, where he became an instructor and then, after he had fnished his dissertation in 1917, a professor. By 1920, he had been promoted to head the department, a position he kept through 1923 before he returning to the University of Chicago. During this period, Thurstone had been exempt from enlistment into World War I because he was underweight. He served in other ways, as an assistant to E. L. Thorndike in a statistical unit established by Yerkes to analyze the results of the Army Alpha tests (Carson, 1993). Thurstone is probably the most well known for his reconceptualization of factor analysis. This began with his groundbreaking article and then book Vectors of Mind (Thurstone, 1934, 1935), continued with a new monograph Primary Mental Abilities (Thurstone, 1938) that provided what was at the time one of the strongest empirical challenges to Spearman’s theory of g, and culminated in the book Multiple Factor Analysis (1947), which was essentially a revision to Vectors of Mind. This last book was the most complete and comprehensive account of Thurstone’s approach to factor analysis, and it remains the animating spirit behind exploratory factor analysis to the present day. There is, however, a strong case to be made that the most impressive stretch of Thurstone’s career took place before he started work on factor analysis, during his frst seven years as a newly appointed professor in the University of Chicago’s Department of Psychology between about 1924 and 1931. Thurstone began his period with the publication of the book The Nature of Intelligence (1924), in which he presented hypotheses about cognitive functioning that were distinct from all the prevailing theories of the time. Over the next fve years, he took on what amounted to three simultaneous lines of work: assembling the materials for a textbook on mental test theory that focused on the concepts of reliability and validity (this was never published but was widely circulated as Thurstone, 1931a), developing new methods for the scaling of educational and psychological tests, and reconceptualizing psychophysics into a method of psychological measurement with potential applicability well beyond the measurement of sensation. To a great extent, the motivation for all three of these contributions came from Thurstone’s appraisal of weaknesses in the established body of knowledge around mental testing and psychophysics when he was called on to teach these topics to his students. Although I do not cover it in the sections that follow, a few words are in order about the second of these three contributions, frst introduced as a method

The Seeds of Psychometrics

263

of absolute scaling for educational and psychological tests but now known as the method of “Thurstonian scaling” in the literature on educational measurement. By the early 1920s, there were two prominent methods for placing achievement and intelligence test scores on a scale of measurement: the “mental age” method that had been introduced by Binet and Simon (1916 [1908, 1911]) and further popularized by Terman (1916), and a “point scale” method that had been championed by Yerkes (1917) and Otis (1918). Thurstone saw signifcant problems with the construction and interpretation of a mental age scale, but there were also problems with existing point scale methods as tools for answering emerging questions about the growth in intelligence over time. Although the original purpose of intelligence tests as Binet had conceived then had been to provide fundamentally normative information that could be used for diagnostic purposes, the use of the tests prompted questions about what could be inferred about the chronological development of intelligence. Thurstone’s contribution in this context was to make explicit that the goal of any testing enterprise enacted for the purpose of making inferences about growth had to be a scale that could satisfy the requirements for order, distance, and origin typical of physical scales of measurement. Along these lines, Thurstone (1928a) wrote most pointedly that “[t]he whole study of intelligence measurement can hardly have two more fundamental difculties than the lack of a unit of measurement and the lack of an origin from which to measure!” (176). The approach to “absolute scaling” that Thurstone introduced and developed between 1925 and 1931, beginning with Thurstone (1925), was essentially an elaboration of Galton’s method of relative measurement (see Chapter 3) that had been applied by Thorndike (1910, 1911, 1913) in the context of creating scales for the legibility of handwriting, the quality of writing, and achievement in drawing. Thurstone’s method was premised on the assumption that a series of tests targeted to children by age measured a common attribute that was normally distributed. Given a design in which there were subsets of items in common for children of adjacent ages, one could use the proportion of these items answered correctly in conjunction with the inverse of the normal cumulative distribution function to link the scale for the lower age test to the scale for the upper age test (or vice versa). In demonstrating this approach, Thurstone noticed an important hidden assumption of the method as it had been implemented by Thorndike: that the variability of the attribute would remain constant over time. Thurstone’s scaling approach relaxed this assumption, and as a consequence, he was able to demonstrate that—contrary to conventional wisdom—there was evidence that variability in intelligence increased with age and showed no evidence of signifcant deceleration by the age of 14. Thurstone also came up with an innovative empirical method to establish a location for an “absolute zero” of intelligence on his scale, one that corresponded with a predicted chronological age of 2 months after conception. Descriptions and applications of Thurstone’s method of absolute scaling in the context of educational tests can be found in Bock (1997); Williams,

264 The Seeds of Psychometrics

Pomerich, and Thissen (1998); and Kolen and Brennan (2004). A limitation of the treatment in Kolen and Brennan is that it focuses on the procedural aspects of the method without providing much context for the problem Thurstone was trying to solve and the emphasis he placed on establishing a meaningful unit of measurement. At the same time, one can feel some sympathy for this disconnect because Thurstone himself may have realized, in hindsight, that the scaling method he had applied in the context of intelligence tests was somewhat inconsistent with his own theoretical perspective, bolstered by his empirical research using factor analysis, that intelligence is a multidimensional attribute. Nonetheless, two aspects of Thurstone’s absolute scaling approach are features that would also appear in the approach he would take to reconceptualize the analysis of psychophysical experiments. First, he had not only developed a method that leveraged the assumption of normality but also provided a means for an empirical check of the plausibility of the assumption. Second, he had formalized the scaling problem in terms of a system of equations and, in doing so, had allowed for diferences in variability when parameterizing the system of equations. But before we turn to the details of Thurstone’s approach to psychological measurement, which he later described in his autobiography as “psychological measurement proper” (Thurstone, 1952, 306), there is one more critical aspect to his background that needs to be appreciated, and that is the role played by his wife, Thelma Gwinn Thurstone. Thelma Gwinn was very much a pioneer in her own right, an accomplished academic in a feld dominated by men. She had graduated from high school at the age of 15 and earned undergraduate degrees in both German and education from the University of Missouri. She later completed a master’s degree in psychology at the Carnegie Institute of Technology, which is also where she met Louis, who was had already become chair of the department by that time. They were married shortly after a oneyear stint in 1923 working together on the improvement of civil service examinations at the behest of the Institute for Governmental Research in Washington, D.C. When Louis was recruited to the faculty of the psychology department at the University of Chicago in 1924, Thelma enrolled in the graduate program there and had earned her PhD two years later, in 1926. Her dissertation study, which focused on the empirical relationship between item difculty and discrimination in mental testing, was eventually published in the Journal of Educational Psychology (T. G. Thurstone, 1932). Thelma was the operational arm of the Thurstone partnership, and between 1923 and 1948, she played a lead role in the creation and development of high school- and college-level aptitude tests for the American Council on Education. It was largely through this work that Louis Thurstone was able to fund his program of research. As Thurstone would refect in his autobiography, Thelma has the outstanding achievement in our family in managing an active household at the same time that she was professionally active. She

The Seeds of Psychometrics

265

has been a partner in every research project in the Psychometric Laboratory. For many years she was in the laboratory daily, helping to plan the projects, supervising most of the test construction, and participating especially in the psychological interpretation of results. In 1948 she left this work to become director of the Division of Child Study in the Chicago Public Schools. This report should really have been written as a biography for both of us. (Thurstone, 1952, 321) It should take nothing away from the brilliance of Thurstone’s accomplishments to point out that the meteoric portion of his career began the same year that he married Thelma Gwinn. It seems unlikely that this was purely coincidental. Thelma, it should be emphasized, was more than just a partner to Louis in his research (e.g., Louis credited her for the critical reading she provided of the frst draft of Vectors of Mind). She also developed her own independent professional identity as an instructor at Chicago Teachers College between 1942 and 1952, a visiting professor at the University of Frankfurt in Germany in 1948, and director of the Division of Child Study for Chicago Public Schools. Much of her energy in the latter position was devoted to the improvement of school services for children with special needs. Louis Thurstone died at the relatively early age of 68 in 1955, just a few years after the Thurstones had left the University of Chicago to join the faculty at the University of North Carolina at Chapel Hill (Louis in psychology, Thelma in education), where they had established a Psychometric Laboratory that still exists to the present day and has been named in his honor. Thelma Thurstone stayed on as a professor of education through 1968 and then continued to work on projects at the Psychometric Laboratory through 1984. She passed away in 1993 at the age of 95.

9.3 9.3.1

Toward Psychological Measurement Discriminal Processes

The broadened conceptualization of psychophysics that Thurstone would introduce between 1927 and 1930 was apparently spurred by his irritation with having to teach the topic to his students. He regarded classical psychophysics as the “dead subject of lifted weights” and the “dullest part of psychology” (Thurstone, 1959, 15): When I started to teach psychological measurement in 1924, it was natural that I should encourage the students to learn something about psychophysical methods. The standard reference was, of course, the two big volumes on quantitative psychology by Titchener. The determination of a limen [a just noticeable diference] was the basic problem in old-fashioned psychophysics. In order to be scholarly in this feld, one was supposed to

266 The Seeds of Psychometrics

know about the old debates on how to compute the limen for lifted weights to two decimal places with a standard stimulus of one hundred grams. One could hardly worry about anything more trivial. Who cares for the exact determination of anybody’s limen for lifted weights? In teaching this subject I felt that we must do something about this absurdity by introducing more interesting stimuli. (Thurstone, 1952, 306–307) In Chapter 2, we took a close look at psychophysics in the context of Fechner’s lifting of weights experiment using the method of right and wrong cases, which became known more generally as the constant method. To recap, a constant method experiment always included a set of n stimuli in the form of weights that could be ordered by magnitude from least to greatest, with each weight in the set difering by a fxed increment. The premise of classical psychophysics was that each stimulus value in the series, xj , can be associated with a corresponding sensory intensity, Yj . Fechner had attempted to use the association between stimulus diferences and the responses to these diferences to build a sequence of measurement units for sensation intensity, Y, in terms of just noticeable differences (jnds). In this way, unknown units of psychological magnitudes could be understood with respect to known units of physical stimuli. The break from classical psychophysics that Thurstone was intent on making was to loosen the restriction that these stimuli, xj, needed to be physical magnitudes. When a person was exposed to pairs of physical stimuli in an experiment such as the constant method, the comparisons that were being made for each pairing were based on a subject’s ability to discriminate the order of the attribute as opposed to its magnitude—is weight 1 heavier than weight 2? Is tone 1 louder than tone 2? In this sense, the only requirement in such experiments was that the stimuli to which a person is exposed can be ordered. But in this case, Thurstone must have wondered, why restrict attention to weight and sounds? In particular, if psychological qualities are at least theoretically orderable, then if some instantiation of these qualities can be presented as stimuli, they can also be compared. And if they can be compared, then a proportion of cases in which stimuli 1 is greater than stimuli 2 can be found, and this proportion can become the basis for locating each stimulus on a psychological continuum. At the same time, if the stimuli are no longer physical quantities with known units, then it no longer makes much sense to speak of jnds, let alone use a jnd to defne the unit of measurement for a psychological attribute of interest. What would take the place of the jnd? The idea that Thurstone would turn to as a rationale for the construction of a psychological continuum with an estimable unit of measurement was the concept of a discriminal process. The concept may seem hopelessly nebulous when presented in the abstract as I am about to do, but more concrete examples will follow. Bear with me. Figures 9.2 and 9.3 present the four original graphical fgures that Thurstone used in sequence to introduce the concept of a discriminal process in Thurstone

The Seeds of Psychometrics

FIGURE 9.2

267

Thurstone’s First and Second Graphics Depicting Discriminal Processes.

Source: Thurstone, 1927a. From American Journal of Psychology. © 1927 by the Board of Trustees of the University of Illinois. Used with permission of the University of Illinois Press.

FIGURE 9.3

Thurstone’s Third and Fourth Graphics Depicting Discriminal Processes.

Source: Thurstone, 1927a. © 1927 by the Board of Trustees of the University of Illinois. Used with permission of the University of Illinois Press.

268 The Seeds of Psychometrics

(1927a). The graphic on the left of Figure 9.2 depicts a sequence of connected circles with the labels “R” and “S” arranged in increasing order from R1 to R7 and S1 to S7 . Thurstone connected these two series to an individual person in the following way: the R series represents discrete stimuli that can be manipulated and intentionally presented to a person; the S series represents parallel psychological or physiological responses associated with each stimulus in the R series. Now, in classical psychophysics, all stimuli would consist of diferent magnitudes of physical quantities, so the distances between any two values of R could be measured as magnitudes. By contrast, Thurstone was only envisioning a scenario in which stimuli could be theoretically ordered from lowest to highest according to some attribute of interest. Hence, there was no assumption that the distance from R1 to R2 was the same as R2 to R3 or R3 to R4 and so on, which is why there are no vertical lines connecting the Rs or the Ss. Examples of the sorts of qualitative stimuli Thurstone had in mind for the R series included distinct handwriting specimens, children’s drawings, or opinions on social issues. So in place of physical magnitudes which had both a known order and magnitude, it was only necessary to assume that handwriting could be ordered from least to most legible; children’s drawings could be ordered from least to most creative; and opinions could be ordered from least to most controversial. Thurstone (1927a) described each response value of the S series as “the process by which an organism identifes, distinguishes, discriminates or reacts to stimuli” (369). In Thurstone’s model, for every uniquely ordered external stimulus, there is a corresponding internal discriminal process. The twist comes in the graphic on the right-hand side of Figure 9.2, which indicates that the relationship between R and S is not deterministic but probabilistic. Here we see that for a specifc stimulus value, R5 , there are diferent possible values for S. This implies that when a person encounters stimulus R5 , we should anticipate that this will trigger discriminal process S5 (notice the thickness of the line connecting the two) but that if the same person were again exposed to R5 , there is some chance that this might instead trigger discriminal process S4 or S6 , S4 , or S8 or even S2 or S7 . It was this hypothetical variability, which Thurstone called “discriminal dispersion,” that he would ultimately use to defne a unit of measurement for a psychological continuum underlying the S series. Before we go on, a modern-day context might help clarify what Thurstone was describing. At the end of each season of play in the National Basketball Association (NBA), a single player is selected as the league’s most valuable player (MVP). The choice is made by 100 members of the national media that cover the sport. Each person votes by providing a rank of their top fve players for the award, and points are awarded to players according to the cumulative number of frst- through ffth-place votes a player receives (frst-, second-, third-, fourth-, and ffth-place ranks receive 10, 7, 5, 3 and 1 points, respectively). At the culmination of the 2019–2020 season, a total of 12 players received at least

The Seeds of Psychometrics

269

one vote. The top fve players included the winner, Giannis Antetokounmpo, who earned 962 points, followed by LeBron James with 753, James Harden with 367, Luka Dončić with 200, and Kawhi Leonard with 168. The choice of MVP is subjective. Some voters focus on a player’s counting statistics such as the average number of points, assists and rebounds per game, or on more advanced statistics that have been invented to capture the varied contributions that players can make toward their team’s success. Other voters may be swayed by more qualitative “narratives” that may bolster the cases of certain players more than others. For the 2019–2020 season, the voters seem to have reached a fairly clear consensus about an overall ordering among top players in contention for the award. Placing all this in the context of the kind of challenge of psychological measurement that Thurstone was seeking to address, imagine that we are focusing attention on the attribute of “value to one’s team” and we ask whether it would be possible to locate this attribute on a psychological continuum with a designated unit of measurement. Where does the concept of a discriminal process arise in this scenario? We would begin by thinking of the profles of the 12 NBA players who received some consideration for the award as 12 “stimuli” to which the media are exposed in the process of casting votes. Now, let R9 represent the higher end stimulus a voter could observe for an MVP candidate and an R1 represent a lower end stimulus. Imagine further that when it comes to the contributions of James Harden, the truth is that he is at the higher end, say, at R7 . Nonetheless, when a voter is asked to think about James Harden on a particular occasion (perhaps looking over his statistical profle, recalling highlights from his season, etc.), this triggers a discriminal process that might be akin to S9 in the left-hand side of Figure 9.3. This might be very close to the ideal conception the voter has for an MVP archetype. But on a diferent occasion, when the same voter thinks of Harden it may instead trigger a discriminal process that is more akin to S3—much lower on the voter’s internal, qualitatively ordered series of NBA value archetypes. Now, if it were possible to repeat this process with the same voter an infnite number of occasions, the idea is that the most commonly observed discriminal process, S7 , would be in accord with Harden’s true location in the R series, R7 . But on any given occasion, this will vary. We can imagine a similar thought experiment when the play of Kawhi Leonard is the stimuli. Imagine that Leonard’s true location is R5 , as shown in Figure 9.3. Again, on diferent occasions, this might induce diferent discriminal processes for the same voter. One thing to notice in the contrast between hypothetical discriminal processes induced by Harden and Leonard for the same voter is that the variability in the diferent S values for Harden are shown to be greater than variability in the diferent S values for Leonard. In other words, the discriminal dispersion associated with R7 is much greater than that of R5 . Now, if it is really true that Harden is qualitatively higher than Leonard in terms of the value he contributes to his team, then on average the discriminal process

270 The Seeds of Psychometrics

Harden will induce in the voter will be higher than the process induced by Leonard. But on any one occasion it need not be, which means there will be some distribution of diferent possible Sharden − Sleonard diferences. As a model, the scenario we are considering may begin to seem familiar. Let j index one of N unique and orderable stimulus values. It follows that S j = rj + ε j .

(9.1)

What we have is essentially the linear error model we encountered in Chapter 6 in the context of Spearman’s formula for the disattenuation of a correlation coefcient, with the distinction that so far nothing about the model involves something that can be empirically observed. The discriminal process within a person is being cast as a variable with a fxed component, a specifc value in the R series, rj , and a random component, ε. The last important step in Thurstone’s model was to posit that the random error component of the model was normally distributed, and this is depicted visually on the right-hand side of Figure 9.3. Therefore, we can say that S is a normally distributed random variable with an expected value of rj and a variance of σ 2j . The expected value is the mean discriminal process across infnite replications, and the standard deviation, which can difer depending on the value of j, is the discriminal dispersion. It was an estimate of the latter that was needed to establish a unit of measurement for a psychological continuum, and the choice can have some consequence to scale interpretations (as illustrated by the diferences in the units of measurement implied by the two outer vertical axes in the left-hand graphic of Figure 9.3). If the progression depicted in Figures 9.2 to 9.3 makes some sense to you as a model for how people make judgments about the attributes of objects and events in the social world that they experience qualitatively—whether it is the value of NBA players to their teams or the attitude of a person toward the importance of social justice—then the details of Thurstone’s law of comparative judgment (presented in the following section), and its applications for psychological measurement (presented in Section 9.4), are likely to resonate. If not, the approach may well seem a bit preposterous. We will return to this issue in Section 9.5. Looking ahead, however, it is not entirely clear that Thurstone took the metaphysics of a discriminal process combined with a discriminal dispersion all that seriously. Rather, what he needed was a rationale that would allow him to conceptualize each individual stimuli as a trigger for the realization of a random variable with a normal distribution. He seems to have invented the discriminal process and discriminal dispersion as metaphors to this end.

9.3.2

The Law of Comparative Judgment

In presenting Thurstone’s law of comparative judgment, I now return to the notation from Bock and Jones (1968) previously established in Chapter 2 when presenting Fechner’s approach to measurement. I use X in place of Thurstone’s

The Seeds of Psychometrics

271

R, and Y in place of Thurstone’s S, but I will maintain his terminology by now thinking of Yj as the random variable that characterizes the diferent possible discriminal processes (where j indexes a unique process) that may be associated with any specifc stimulus value xj. Each Y j ~ N (µ j , σ 2j ) has μj and σj as the mean discriminal process and variance (i.e., dispersion) induced by xj. The path toward locating a stimulus value on a psychological continuum was in the idea that while the discriminal process for any single stimulus cannot be observed, the diference in the processes induced by a pairing of stimuli can be estimated. To do so, a person must be asked to make a judgment about order when presented with a pair of stimuli. Recalling the example from the previous section, we can present a person with the names of two MVP candidates, James Harden and Kawhi Leonard, and then ask the person which of the two was the more valuable players during the 2019–2020 season. If the discriminatory process can be replicated, we can generate an observed proportion of judgments in which Harden is viewed as the more valuable player. More generally, for any discriminatory comparison between two paired stimuli, {xt, xc}, Thurstone’s law of comparative judgment2 is (9.2)

ut − uc = zt c σt2 + σc2 − 2rt c σt σc . But this can also be written as zt c =

−1

(Pt c ) =

µt − µc 2 t

σ + σc2 − 2r σt σc

,

(9.3)

or equivalently,   µt − µc  Pt c =  .  σ 2 + σ 2 − 2rσ σ   t c t c 

(9.4)

In the last two equations above, Pt c represents the probability of observing the judgment that xt  > xc , is the standard normal cumulative distribution function, μt  − μc is the mean diference in the discriminal processes associated with the stimuli {xt , xc}, and σt2 + σc2 − 2rσt σc represents what Gulliksen (1950) later described as the comparatal dispersion. The comparatal dispersion term is the sum of the variance of each of the two discriminal processes associated with stimulus pairing {xt, xc}, as well as the covariance between the two processes. Unlike Fechner’s approach when applying the constant method, in which variable stimuli (the “treatment”) were compared to a constant stimulus (the “control”), Thurstone’s law of comparative judgment specifes a system of equations for all possible unique pairs of judgments. That is, given a series of n stimuli, since each stimulus could serve as both a treatment and a control for the contrast {xt , xc}, and if the same stimulus was not used as both standard

272 The Seeds of Psychometrics

n n − 1) and variable stimulus in the same comparison, then there will be ( unique 2 versions of Equation 9.4. As such, Fechner’s analysis of results using the constant method represents a special case of Thurstone’s expression in which (a) there are only n comparisons being made, (b) the variability in the discriminal processes evoked are independent of the pairs of stimuli to which a subject (or subjects) are exposed, and (c) most importantly, the discriminal dispersions are assumed to be equal. In this scenario, Equation 9.4 reduces to  µ − µc  Pt c =  t ,  σ 2 

(9.5)

and one only needs to substitute the observed proportion pt c for Pt c, set the unit of measurement to be σ = 1, and then take the inverse of the normal cdf to get an estimate for μt  − μc. And then if the location of a single stimulus on this new scale was fxed, independent locations of all the rest could be established in relationship to it. In all, Thurstone (1927b) described fve diferent cases that corresponded to increasingly restrictive assumptions on his general expression of Equation 9.2. The frst two cases maintained the full form of the equation, but difered in the data being modeled, with Case 1 representing the repeated judgments of a single subject, and Case 2 representing the judgments of a sample of subjects making single judgments.3 Implicitly, Cases 1 and 2 difer in the assumptions being made about the chance process that induces discriminal dispersion. In Case 1, it is assumed that the dispersion comes from (hypothetical) results from replications of a judgment within a single person. In Case 2, it is assumed that this comes from variability in judgments from a random sample of people asked to form judgments, and implicitly, that between-subject variability is an equal or adequate substitute for within-subject variability. In Case 3, the expression is simplifed by the assumption that the discriminal processes for all diferent stimuli pairings are uncorrelated. Case 4 also imposes the restriction of no correlation, and in addition expresses the discriminal dispersions of any unique stimulus as a simple linear function of one value of the dispersion term (under the assumption that the diferences in discriminal dispersions for any pair of stimuli would tend to be small). Finally, Case 5 represents the most constrained version of the expression in that not only is there no correlation between discriminal processes but that all discriminal dispersions are equal as well. An immediate contribution of Thurstone’s derivation of the law of comparative judgment was to make certain assumptions of the methods of classical psychophysics more transparent. In the process, he was able to resolve some common misconceptions in the prevailing literature. For example, Fechner’s law and Weber’s law were often referred to interchangeably or combined as the “Weber-Fechner” law. Using the more general expression of the law of

The Seeds of Psychometrics

273

comparative judgment, Thurstone was able to demonstrate that the two laws were independent of one another (Thurstone, 1927a, 1927c). If discriminal dispersions were constant as in Thurstone’s Case 5, then Fechner’s Law and Weber’s Law will coincide. But when discriminal dispersions could vary (as in Cases 1–4), Fechner’s Law could apply even when Weber’s Law cannot be verifed and vice versa. More generally, Thurstone was able to show that just as he had found in the context of educational scaling, a variety of results in classical psychophysics hinged upon assumptions being made about the unit of measurement through restrictions placed or relaxed upon variability in errors that took on the kind of interpretation he was giving to discriminal dispersions. These were assumptions that, with some efort and ingenuity, could be evaluated empirically. Importantly, Thurstone proposed an empirical check on the assumption common to all fve cases of the law of comparative judgment, that discriminal dispersion could be derived from the variance of a normal distribution. Consider a situation in which we pick one stimulus xc to serve as a control, and judgments are made comparing two treatment stimuli to this control under the assumption of no correlation among discriminal processes (i.e., Case 4). Then since ut1 − uc = zt1c σt21 + σc2 and ut2 − uc = zt2 c σt22 + σc2 , this implies ut1 − ut2 = zt1c σt21 + σc2 − zt2 c σt22 + σc2 .

(9.6)

Now, if the normality assumption holds, then the diference between the two treatment stimuli cannot depend on the choice of stimulus being used as the control. That is, if we instead substituted other stimuli for xc , we should still fnd the same diference for ut1 − ut2. Thurstone regarded this as a check on the internal consistency of the approach.

9.4

Constructing a Psychological Continuum

One of the more interesting illustrations of Thurstone’s approach to measurement can be found in a study in which Thurstone examined whether going to the movies (still a relatively new invention in the late 1920s) could be shown to have an efect on attitudes (Thurstone, 1931b). To pull of the study, Thurstone needed a way to construct a measure of a social attitude, so let’s see how he built from his law of comparative judgment in two diferent ways for this purpose. The frst represents a direct application of Case 5 of the law of comparative judgment, the second represents a related approach that became known as the method of equal appearing intervals. The study Thurstone published involved experiments that revolved around two diferent flms that came out in 1930, Street of Chance4 and Hide-Out,5 and two diferent samples of high school–aged students in the cities of Mendota and Princeton in Illinois. Each flm had as a central component the portrayal of a

274 The Seeds of Psychometrics

protagonist who was involved in what could have been considered at the time a form of criminal activity, gambling (Street of Chance) and bootlegging (HideOut; the term bootlegging refers to selling alcohol during Prohibition in the United States, which lasted from 1920–1933). The basic design of each study was a single group pre-post experiment. Students’ attitudes toward crime and prohibition respectively would be solicited one week before and after seeing each movie. The empirical question was whether there was evidence of a signifcant change in attitudes and whether this change could be attributed to seeing the movie. Ultimately Thurstone concluded that seeing Street of Chance did have an efect, while Hide-Out did not, but here we will put aside the validity of these causal inferences and instead focus on the methods Thurstone applied to measure the students’ attitudes toward crime on each occasion.

9.4.1 Applying the Law of Comparative Judgment in the Street of Chance Experiment For the Street of Chance experiment, Thurstone generated a list of 13 ofenses or activities that could be perceived as crimes. He then arranged the 13 ofenses into all 78 possible pairwise comparisons and instructed the students participating (n  = 240) to underline “the one crime of each pair that you think should be punished more severely.” The students were told that in the event they could not decide, they should still pick one choice, even if this had to be done at random. The results, indicating the proportion of time one ofense was judged to be more severe than another, were recorded and tabulated, and this was done for each occasion before and after seeing the flm. An example of the results from the frst occasion, adapted from Thurstone (1931b), is shown in Table 9.1. In this context, the 13 offenses are analogous to the stimuli in a psychophysical experiment, and Thurstone’s (1927c) aim was to locate these stimuli on a psychological scale such that the distances between them could be expressed in terms of what he would refer to as a subjective unit of measurement. Notice that each column in Table 9.2 could be thought of as the results from implementing Fechner’s constant method after choosing any one of the 13 offenses as the control stimuli to which the other test stimuli would be compared.Thurstone proceeded by applying Case 5 of the law of comparative judgment to the data. His frst step was to transform each of the observed proportions into a corresponding standard normal deviate by computing zt c = −1( pt c ). So, for example, let each column in Table 9.1 from 1 to 13 represent a unique value for a control stimulus (i.e., c = 1, . . . , 13) and then each distinct cell within that column will represent a test stimulus (i.e., t = 1, . . . , 12). Since only 7% of students judged being a gambler to be an offense worthy of more severe punishment than being a bank robber, z21 = −1.48.With these values in place in Table 9.2, the next task is to locate the distances between each stimulus pair on a common scale.Applying Case 5 of the

TABLE 9.1 Proportion of the Schoolchildren in Mendota, Illinois, Who Said That the Ofense at the Top of the Table Is More Serious Than the Ofense at the Side of the Table

 

bank robber gambler pickpocket drunkard quack doctor bootlegger beggar gangster tramp speeder petty thief kidnapper smuggler MEAN

1 2 3 4 5 6 7 8 9 10 11 12 13

br 1 0.93 0.92 0.95 0.73 0.71 0.99 0.50 0.99 0.94 0.98 0.27 0.79 0.81

gam 2

pp 3

dr 4

qd 5

bl 6

beg 7

gang 8

tr 9

sp 10

pt 11

ki 12

sm 13

0.07

0.08 0.71

0.05 0.52 0.25

0.27 0.76 0.67 0.81

0.29 0.92 0.75 0.95 0.49

0.01 0.07 0.02 0.01 0.02 0.01

0.50 0.92 0.86 0.92 0.70 0.79 0.96

0.01 0.05 0.02 0.03 0.02 0.01 0.42 0.02

0.06 0.41 0.39 0.37 0.12 0.09 0.86 0.08 0.91

0.02 0.49 0.42 0.62 0.22 0.26 0.96 0.08 0.97 0.58

0.73 0.90 0.87 0.91 0.64 0.68 0.99 0.36 0.99 0.90 —

0.21 0.81 0.68 0.87 0.55 0.50 0.99 0.31 0.99 0.92 0.78 0.27

0.29 0.48 0.24 0.08 0.93 0.08 0.95 0.59 0.51 0.10 0.19 0.38

0.75 0.33 0.25 0.98 0.14 0.98 0.61 0.58 0.13 0.32 0.49

0.19 0.05 0.99 0.08 0.97 0.63 0.38 0.09 0.13 0.36

0.51 0.98 0.30 0.98 0.88 0.78 0.36 0.45 0.65

0.99 0.21 0.99 0.91 0.74 0.32 0.50 0.67

0.04 0.58 0.14 0.04 0.01 0.01 0.08

0.98 0.92 0.92 0.64 0.69 0.82

0.09 0.03 0.01 0.01 0.06

0.42 0.10 0.08 0.32

— 0.22 0.44

0.73 0.79

0.66

Source: Thurstone (1931b). Note: Cells with “—” were left blank in Thurstone’s original publication with no explanation given. A few values of 0 and 1 were changed to .01 and .99 to make it possible to compute z-scores, and this brings to light one weakness of the approach when perfect discrimination is possible in a given comparison.

The Seeds of Psychometrics

 

275

Top of the Table Is More Serious Than the Ofense at the Side of the Table  

 

bank robber gambler pickpocket drunkard quack doctor bootlegger beggar gangster tramp speeder petty thief kidnapper smuggler SUM MEAN

1 2 3 4 5 6 7 8 9 10 11 12 13

br 1 1.48 1.41 1.64 0.61 0.55 2.33 0.00 2.33 1.55 2.05 –0.61 0.81 12.67 1.18

gam 2

pp 3

dr 4

qd 5

bl 6

–1.48

–1.41 0.55

–1.64 0.05 –0.67

–0.61 0.71 0.44 0.88

–0.55 1.41 0.67 1.64 –0.03

–0.55 –0.05 –0.71 –1.41 1.48 –1.41 1.64 0.23 0.03 –1.28 –0.88 –4.38 –0.37

0.67 –0.44 –0.67 2.05 –1.08 2.05 0.28 0.20 –1.13 –0.47 0.62 0.05

–0.88 –1.64 2.33 –1.41 1.88 0.33 –0.31 –1.34 –1.13 –4.43 –0.37

0.03 2.05 –0.52 2.05 1.17 0.77 –0.36 –0.13 6.48 0.54

2.33 –0.81 2.33 1.34 0.64 –0.47 0.00 8.51 0.71

beg 7 –2.33 –1.48 –2.05 –2.33 –2.05 –2.33 –1.75 0.20 –1.08 –1.75 –2.33 –2.33 –21.59 –1.80

gang 8 0.00 1.41 1.08 1.41 0.52 0.81 1.75 2.05 1.41 1.41 0.36 0.50 12.69 1.06

tr 9 –2.33 –1.64 –2.05 –1.88 –2.05 –2.33 –0.20 –2.05 –1.34 –1.88 –2.33 –2.33 –22.42 –1.87

sp 10

pt 11

ki 12

sm 13

–1.55 –0.23 –0.28 –0.33 –1.17 –1.34 1.08 –1.41 1.34

–2.05 –0.03 –0.20 0.31 –0.77 –0.64 1.75 –1.41 1.88 0.20

0.61 1.28 1.13 1.34 0.36 0.47 2.33 –0.36 2.33 1.28 —

–0.81 0.88 0.47 1.13 0.13 0.00 2.33 –0.50 2.33 1.41 0.77 –0.61

–0.20 –1.28 –1.41 –6.78 –0.57

Source: Thurstone (1931b). Note: Unit normal deviates for each cell computed as zr c =

(pr c) where subscript r indexes a row from Table 9.2 and c indexes a column.

−1

— –0.77 –1.73 –0.16

0.61 11.38 1.03

7.51 0.63

276 The Seeds of Psychometrics

TABLE 9.2 Unit Normal Deviates Associated With the Proportions of the Schoolchildren in Mendota, Illinois, Who Said That the Ofense at the

The Seeds of Psychometrics

277

law of comparative judgment6 for two different control stimuli (c = 1 for bank robber and c = 2 for gambler), µt − µ1 = zt 1

(9.7)

µt − µ2 = zt 2 .

(9.8)

and

If we subtract Equation 9.8 from Equation 9.7, we get µ1 − µ2 = zt 1 − zt 2 .

(9.9)

Writing this in summation form so that we can see all the comparisons between variable stimuli and the bank robber condition and all comparisons between variable stimuli and the gambler condition, we get n −1(µ1 − µ2 ) = ∑ zt 1 − ∑ zt 2 t

(µ1 − µ2 ) =

t

(∑t zt1 − ∑t zt 2 ) n −1

,

(9.10)

where n is the total number of stimuli. The basic idea is to estimate the distance between bank robber and gambler by fnding the average linear distance between gambler and all other crimes and then subtracting this from the average distance between bank robber and all other crimes. We could follow the same approach to fnd the distances between any other two crimes. The more that estimates of distances between a pair of crimes using (∑ z − ∑t zt 2 ) differ from estimates based zt c = −1(pt c), the more (µ1 − µ2 ) = t t1 n −1 this strains the credibility of the assumption of a constant discriminal dispersion and hence the validity of the law of comparative judgment. Thurstone checked this by plotting the z-values across pairs of columns as in Table 9.2 and found that the two values fell along a straight line with a unit slope and a predictable difference in the intercept. He took this as evidence in favor of the internal consistency of the approach. Back to the attitude toward crime scale. What Thurstone had now was a method for fnding the linear distances between any two crimes. What he needed next was to locate these distances on a common scale. As there is no obvious origin for such a scale, Thurstone set one by choosing the crime that was most frequently judged to be more severe than others (gangster) and setting its scale value to 0. He then expressed all other crimes as positive deviations from this origin so that higher values represented crimes for which attitudes

278 The Seeds of Psychometrics 3

After

2

1

Gambler

0 0

1

Before

2

3

Seriousness of Crimes as Judged by 240 High School Students in Mendota, Illinois, Before and After Seeing the Film Street of Chance. Higher values represent more serious crimes.

FIGURE 9.4

were more lenient. He followed the identical approach for the table of proportions gathered from students the day after they had watched Street of Chance. He then plotted the two sets of scale values for the 13 crimes and ft a line to the points. I have re-created this plot in Figure 9.4. Two things stood out. First, it was obvious that the location on the attitude scale for gambler had changed rather signifcantly, while the locations for all other crimes stayed about the same. Evidently, the students in this study found gambling to be worthy of more severe punishment relative to the other crimes after seeing the flm.7 Second, Thurstone found that the slope of the line was less than 1 (it was .95), the consequence of a spread in scale values after the second administration of the paired comparisons that was smaller than the spread after the frst administration. He inferred from this that the discriminal error had been larger on the second occasion, leading to slightly smaller scale separations. To his mind, this was most likely the result of students getting bored with the task and therefore putting less cognitive efort into each comparison. But because the unit of each scale on each occasion was defned by the average discriminal dispersion, and since the two difered, they were no longer comparable. To account for this, he added a “stretching factor” of .046 to each crime’s location on the second occasion scale so that the same crime could be compared across occasions with respect to a common unit. This would be an early example of what is now known in the educational measurement literature as test equating. Thurstone represented the fnal result of his eforts visually on the two scales shown in Figure 9.5.

The Seeds of Psychometrics

Gangster Bank robber Kidnapper

0

Bootlegger Smuggler Quack doctor

.5

1.0 Pickpocket

Petty thief Gambler Drunkard Speeder

1.5

279

Bank robber Gangster Kidnapper

Smuggler Quack doctor Bootlegger

Gambler Pickpocket

Petty thief Drunkard

Speeder

2.0

2.5 Beggar Tramp

BEFORE

Beggar

3.0

Tramp

AFTER

Thurstone’s Item Map Showing the Changes in Scale Locations Before and After Students Viewed the Film Street of Chance.

FIGURE 9.5

Source: Thurstone (1931b). © Taylor & Francis.

9.4.2 Applying the Method of Equal Appearing Intervals in The Hide-Out Experiment In the Street of Chance experiment, Thurstone had created a scale for attitude toward crime based on student judgments and located each of 13 crimes on that scale. What he did not do is attempt to locate the students themselves on this scale. In his second experiment with 254 students in the town of Princeton, Illinois, he did both, this time employing a survey technique that was inspired

280 The Seeds of Psychometrics

by the law of comparative judgment but involved some modifcations, such that it became known as the method of equal appearing intervals. Three years earlier, Thurstone (1928a) had introduced this approach with the provocatively8 titled article “Attitudes Can Be Measured”. There are at least two practical drawbacks to the collection of pairwise judgments as a basis for measurement. The frst is that as the number of stimuli increases the number of pairwise comparisons that need to be made increases exponentially. Already with just 13 instances of crimes as stimuli subjects were required to make 78 comparisons, and Thurstone suspected the onset of boredom by the second occasion. If this had been doubled to a scenario in which there were thought to be 26 unique stimuli, this would have required 325 comparisons. The second drawback is that the point of the psychophysical method going back to Fechner had been to locate stimuli on a new measurement scale, not to locate the subject or subjects providing the judgments on the scale. Indeed, notice that as a statistical model, the law of comparative judgment does not include a person-specifc parameter (for additional details, see Andrich, 1978). Thurstone, in collaboration with some of his graduate students at the time, came up with what he would describe as a “cruder” approach that only indirectly invoked the law of comparative judgment but had the advantage that it could be used to locate both stimuli and people on the same scale. It was still a time-consuming approach, but the time commitment came from the up-front cost of establishing the scale. Once established, using a survey to locate people on the scale would be straightforward. In the movie Hide-Out, the focus was on a protagonist that was a bootlegger, and of interest was whether seeing the movie would have an efect on students’ attitudes toward Prohibition. How would this be measured? The approach taken (as implemented by one of Thurstone’s graduate students, Hattie Smith) proceeded as follows. First, subjects would be gathered, and the literature would be searched to assemble a list of about 100 to 150 statements that represented a continuum of statements ranging from those that were strongly against Prohibition to those that were neutral about it and to those that were strongly in favor of it. Examples included the following: • • •

Strongly in Favor: “Since the liquor trafc is a curse to the human family it must be dealt with by law.” Neutral: “Both good and bad results have come from the 18th Amendment.” Strongly Against: “The 18th Amendment should be repealed.”

Next, a panel of 200 to 300 raters would be gathered, and they would be presented with each of the 100 to 150 statements on small cards. Each rater’s task was to sort the cards into 11 piles with only the middle pile (neutral) and the piles on either of the two ends labeled (strongly negative and strongly afrmative). Raters were asked to sort the cards so that they seem to be “fairly

The Seeds of Psychometrics

281

evenly spaced” among the 11 piles (this explains the naming of this technique as “equal appearing” intervals). What was Thurstone after here? Well, what he was trying to simulate was a process that would produce, for each stimulus (i.e., statement about Prohibition), something akin to a discriminal process. If, for example, he looked at the results of 300 ratings for the item “Both good and bad results have come from the 18th Amendment,” he would fnd some frequency distribution of ratings. To most raters, this statement would be interpreted as neutral and be placed into pile 6, but other raters might see it as at least somewhat negative or somewhat positive and place it into piles 3–5 or 7–9. The center of this distribution would provide a basis for inferences about the location of the item on a scale, and the spread would provide a basis for inferences about the discriminal dispersion. Thurstone modeled the cumulative category response frequencies for each item with a normal ogive response function and set the location of the item on the 1-to-11 scale as the value where the curve crossed the .50 threshold. Here Thurstone was borrowing directly from the concept of the jnd in classical psychophysics, since an item at this location of the scale would be the point at which we could predict that an individual will endorse this statement across 50% of repeated occasions or where 50% of individuals with this attitude level would endorse the statement. At this stage, one would now have somewhere in the range of 100 to 150 items that could be located on an attitude scale. The next challenge was to choose a subset of these items that could be administered as a survey. Here Thurstone proposed forms of item analyses that could be used to group the items in terms of their ability to adequately discriminate between individuals at diferent locations of the scale. The easiest analysis to implement involved an examination of the slopes of each item’s cumulative distribution function (or, equivalently, the spread of each item’s frequency distribution). The fatter the curve, the more ambiguous the statement, and ambiguous statements were best excluded from the survey. Thurstone called this the “objective criterion of ambiguity.” In a more involved analysis that required administering the full set of candidate items as a survey, one would examine for each item the spread of the locations along the fnal scale for those who had all endorsed the item. Thurstone called this the “objective criterion of irrelevance.” A fnal set of about 20 to 30 items would be selected with locations that spanned the full continuum of the intended scale and that were neither ambiguous nor irrelevant according to Thurstone’s statistical criteria. For the Hide-Out experiment, Thurstone administered a 28-item survey with a range of opinion statements about prohibition both before and after students saw the movie. Students were asked simply to put a check mark next to each statement (item) if they agreed with it or a cross next to it if they disagreed. A student’s location on the scale was then estimated by taking the mean of the scale locations for all endorsed items. The uncertainty of a student’s location could be estimated by the standard deviation of the scale locations for all endorsed items. Figure 9.6

282 The Seeds of Psychometrics 60 50

Frequency

Before

40 30 20 10 4.27

0

2

3

4

5

6

7

8

9

8

9

70 60

Frequency

50

After

40 30 20 10 4.21

0

2 Favorable

3

4

5

6

Attitude toward Prohibition

7

Unfavorable

FIGURE 9.6 Thurstone’s Comparison of the Distributions of Students’ Attitudes Toward Prohibition Before and After Seeing the Movie Hide-Out.

Source: Thurstone (1931b). © Taylor & Francis.

shows the graphics that Thurstone used to compare the distribution of studentspecifc scale scores before and after the movie. Although there was some evidence that attitudes toward prohibition became more skewed after seeing the flm, any apparent efect on the mean looked negligible. Why did Thurstone consider this method to be crude relative to the approach that involved all paired comparisons and the direct application of the law of comparative judgment? Because the scale only established locations without having an internal experimental basis for establishing and validating the unit. In the survey approach, the unit of measurement is specifed in a mostly ad hoc way by the developer when deciding on the number of categories or bins into which each item was sorted during the survey development phase. For the

The Seeds of Psychometrics

283

Prohibition survey, the unit was set in advance to be 1/10 of the distance from the most negative to the most positive opinion about prohibition. In contrast, for the more general attitude toward crime scale, the unit was set empirically with respect to the comparatal dispersion term, σt c. To Thurstone, claims of measurement in psychology, much like claims of measurement in the physical sciences, were to be evaluated in terms of the empirical warrant that could be mustered to justify the measurement unit.

9.5 9.5.1

Thurstone’s Conception of Measurement Subjective Measurement Units

Perhaps the only frustrating aspect of reading Thurstone’s work, which is, on the whole, quite remarkable for the clarity and lucidity of his prose, is his failure to provide much of any insight into the infuences on his thinking.9 In Thurstone’s contributions to educational scaling, it is obvious he was building on a foundation already established by Galton and Thorndike in particular, yet citations and references to them (or anyone else for that matter) are conspicuously absent. Similarly, his reconceptualization of psychophysics must have drawn from Fechner, but there is no evidence that Thurstone was familiar with Fechner’s original research program as opposed to the secondary or tertiary characterizations of it through Titchener’s (1905) textbook. Certainly relative to Fechner, Thurstone’s approach to measurement greatly broadened the scope of psychological attributes that became amenable to quantitative treatment, and his theoretical rationale struck some familiar chords relative to Fechner, Galton, and Spearman in that he equated the challenge of psychological measurement with the challenge of measuring intensive attributes in the physical sciences: In almost every situation involving measurement there is postulated an abstract continuum such as volume or temperature, and the allocation of the thing measured to that continuum is accomplished usually by indirect means through one or more indexes. Truth is inferred only from the relative consistency of the several indices, since it is never directly known. We are dealing with the same type of situation in attempting to measure attitude. We must postulate an attitude variable which is like practically all other measurable attributes in the nature of an abstract continuum, and we must fnd one or more indexes which will satisfy us to the extent that they are internally consistent. (Thurstone, 1929, 217) To Thurstone, psychological measurement had to begin with the postulation of an attribute that could be located on an “abstract” continuum. We can see

284 The Seeds of Psychometrics

the exact moments that postulation was enacted in his presentation of discriminal processes and dispersion in Figures 9.2 and 9.3: it happens in three steps, frst with the assumption that qualitative stimuli can be ordered, second in assuming the existence of a parallel series of qualitatively ordered discrete internal processes, and third in assuming the existence of a continuous distribution of the processes with a unit of measurement defned by a standard deviation. Measurement in the human and physical sciences alike aspire to be an activity that transforms something initially observed as an order into that which is subject to quantitative evaluation. What made Thurstone’s approach unique was that the transformation from order to quantity was premised upon error, or what S. S. Stevens (1966b) later referred to as the “confusion” of subjects when they were asked to make discriminations. Consider an individual exposed to pairs of stimuli and asked to compare them on repeated occasions. Now, if the same individual always placed stimuli into the same internal order when the process was replicated, Thurstone would have had to conclude that measurement was impossible. What opened the door was, ironically enough, the introduction of chance errors. The concept of discriminal dispersion gave Thurstone the tool he needed to construct a psychological continuum that was, quite literally, the result of stitching together the probability distributions evoked by observable stimuli. It provided him with what he would repeatedly refer to as a mental unit of measurement. But this is not to say that Thurstone saw no distinction between physical and psychological measurement. With respect to the psychological scale itself, he wrote, “The psychological scale is at best an artifcial construct. If it has any physical reality, we certainly have not the remotest idea what it may be like” (Thurstone, 1927b, 44). In my view, the diference Thurstone saw between psychological and physical measurement had everything to do with the meaning and interpretation of physical vs. mental units. Thurstone viewed physical measurement as premised on the availability of physically realizable standard units. Such units had a reality that could be physically observed and experimentally replicated; hence, they were objective units of measurement. By contrast, the standard units that could be established for a psychological scale, constructed from the chance variability in a stimulus-response process, might, under a bestcase scenario, be experimentally replicable, but they could never be physically observed. Thurstone referred to them, quite intentionally, as subjective units of mental measurement—not subjective in the sense that the methods for establishing the units were subjective but subjective because they depended upon an individual’s internal response process that was governed, at least in part, by chance. There is a seeming paradox here. One way to evaluate the quality of a psychological scale crafted using Thurstone’s methods would be to look for evidence of sharp discrimination among the source stimuli being compared or rated. But if taken to the extreme, such that stimuli could be generated for which perfect

The Seeds of Psychometrics

285

discrimination were possible, then this would mean that there would be no discriminal dispersion. And without discriminal dispersion, there is no unit of measurement, and the scale collapses to a set of ordered values. For the psychological continuum to be interpretable as a quantitative continuum, we need error. But is it not the point of good measurement to continuously fnd ways to reduce error? And as for error, can it really be modeled as a chance process? If chance comes from within, the prospects for evaluating its shape over replications are fraught with difculty. This paradox is still a matter of contemporary debate, although it has more recently been framed around the application of the Rasch Model (Rasch, 1960). For details, see Sijtsma (2012), Humphry (2013, 2017), Michell (2014), and Sijtsma and Emons (2013).

9.5.2

The Role of Invariance

Although Thurstone’s method of measurement can be viewed as something of an amalgam of conceptualizations that had already been introduced by Fechner and Galton and therefore subject to the same sorts of criticisms related to the quantity objection (see Michell, 2012b), Thurstone placed much greater emphasis on making the assumptions of his methods transparent and subject to falsifcation. He expected psychological units of measurement to be subjective but for the measurer to be objective. For example, as we have seen, Thurstone proposed empirical checks that could be conducted to test the adequacy of his assumption of normally distributed discriminal errors. Although the analyses he was suggesting were probably insufcient to settle the issue (see Luce, 1994), the principle that the validity of psychological measurement should be subject to strong empirical checks was important in keeping the question of measurability as a hypothesis rather than a given. Thurstone took this idea of using experimental evidence to test modeling assumptions even further by stressing the principle of invariance as the essential criterion for generalizing from an underlying model (e.g., the law of comparative judgment) to the application of the model to build a measuring instrument. In Attitudes Can Be Measured, he wrote: The scale must transcend the group measured. One crucial experimental test must be applied to our method of measuring attitudes before it can be accepted as valid. A measuring instrument must not be seriously afected in its measuring function by the object of measurement. To the extent that its measuring function is so afected, the validity of the instrument is impaired or limited. If a yardstick measured diferently because of the fact that it was a rug, a picture, or a piece of paper that was being measured, then to that extent the trustworthiness of that yardstick as a measuring device would be impaired. Within the range of objects for which the measuring instrument is intended its function must be independent of the object of measurement.

286 The Seeds of Psychometrics

We must ascertain similarly the range of applicability of our method of measuring attitude. It will be noticed that the construction and the application of a scale for measuring attitude are two diferent tasks. If the scale is to be regarded as valid, the scale values of the statements should not be afected by the opinions of the people who help to construct it. This may turn out to be a severe test in practice, but the scaling method must stand such a test before it can be accepted as being more than a description of the people who construct the scale. (Thurstone, 1928a, 547) But if this emphasis on modeling and experimentation as a necessary basis for generalization was the teeth behind Thurstone’s approach to measurement, in his applied work he always pragmatic. For example, although Thurstone viewed the approach to attitude measurement using a set of survey items administered in sequence as a crude approach relative to a paired comparisons experiment, for the practical purposes of making social attitudes something that was subject to quantitative evaluation, this was a sacrifce he was prepared to make.

9.6 Likert Scales By the early 1930s, it seems that not just psychologists but also a wide variety of social and behavioral researchers were happy to accept Thurstone’s assertion that attitudes could be measured. But in making this case, Thurstone may have won the battle but lost the war, because what social and behavioral researchers appeared less willing to do was pay the price in research diligence that Thurstone expected. That is, they wanted the practical utility of expedient methods and analyses, and they surely wanted the prestige that claims of measurement ensured but not at the cost of undertaking long-term programs of experimentation with invariance as a condition for generalizable measurement. Thurstone’s (1952) irritation along these lines was easy enough to perceive in the culminating pages of his autobiography: There was heavy correspondence with people who were interested in attitude measurement, but they were concerned mostly with the selection of attitude scales on particular issues to be used on particular groups of people. There seemed to be very little interest in developing the theory of the subject. The construction of more and more attitude scales seemed to be unproductive, and I decided to stop any further work of this kind. Incomplete material for a dozen more attitude scales was thrown in the wastebasket and I discouraged any further work of that kind in my laboratory. I wanted to clear the place for work in developing multiple factor analysis.  .  .  . The excuse is often made that social phenomena are so complex that the relatively simple methods of the older sciences do not

The Seeds of Psychometrics

287

apply. This argument is probably false. The analytical study of social phenomena is probably not so difcult as is commonly believed. The principal difculty is that the experts in social studies are frequently hostile to science. They try to describe the totality of a situation and their orientation is often to the market place or the election next week. They do not understand the thrill of discovering an invariance of some kind which never covers the totality of any situation. Social studies will not become science until students of social phenomena learn to appreciate this essential aspect of science. (312) The case of Rensis Likert seems like a good example of what Thurstone was alluding to when he wrote of those more concerned with the selection of attitude scales for particular issues to be used with particular people than with the development of the “theory of the subject.” Likert (pronounced LICK-urt) had just completed a doctoral dissertation (1932) and then a publication that summarized key fndings from the dissertation (Likert, Roslow, & Murphy, 1934) as Thurstone was pausing his work using the law of comparative judgment and its extensions as a basis for attitude measurement. In his dissertation, Likert argued that one could avoid the time-consuming steps of constructing a survey scale using Thurstone’s method of equal-appearing intervals. Recall that this approach required a convening of raters to group opinion statements into ordered categories and then retaining the items with the best discriminatory properties. The alternative Likert has in mind was to pose positively and negatively worded opinion statements and let subjects locate themselves, item by item, on a response scale continuum of fve graded categories from strongly disagree to strongly agree (with neutral as the middle category). One need only reverse score the negatively worded items and then take the sum or average of all survey item responses to place all respondents onto an attitude scale. Using this approach with subsets of the opinion statements Thurstone and his colleagues had previously created and tested, Likert was able to show that scale values for respondents were strongly correlated irrespective of the approach used to score the diferent styles of opinion statement responses. Meanwhile, the Likert scale scores were more reliable and took far less time to construct and administer. The Likert approach eventually became the dominant paradigm in the construction of attitude surveys, and this continues to be the case to this day. Had Likert actually read Thurstone’s work in any detail? Did he understand any of the theoretical issues involved in the measurement of psychological attributes? If he did, he gave no indication in his published work. After a fouryear stint as the director for research at the Life Insurance Sales Research Bureau from 1935 to 1939, Likert went on to organize and direct the Division of Program Surveys for the American federal government’s Bureau of Agriculture Statistics. He stayed in this role through 1946, at which point he and some of

288 The Seeds of Psychometrics

the associates he had recruited to the bureau founded the Survey Research Center at the University of Michigan. Likert held a faculty position as a professor of psychology at the university until his retirement in 1970. During this time, while he built up the survey operations of his new center, Likert’s publications tended to focus predominantly on the topic of human management. Somewhat remarkably, it appears Likert never followed up on his initial research contrasting Thurstone’s methodological approach to his own in the context of attitude measurement, nor was this a topic explored by any of his associates at the University of Michigan. Likert was known and praised for his pragmatism, and his “Likert Scale” survey has remained the most enduring instantiation of this pragmatism.10

9.7 Thurstone’s Legacy In the psychophysical methods that started with Fechner in the 19th century and were elaborated by Thurstone in the frst half of the 20th century, we can fnd almost all the foundational seeds that would turn into the psychometrics of the 21st century: the modeling of item responses as a probabilistic function of item characteristics, the concepts of item difculty and discrimination, the invariance of parameters in a statistical model, the restrictiveness of assumptions imposed on a statistical model, scaling decisions and scale comparability, issues at the heart of what it means for a measure to be valid and reliable. Interestingly, Thurstone’s reconceptualization of psychophysics may have had its most notable impact in spurring methodological innovations for the prediction of choice behavior in behavioral economics. This was an application that occurred to Thurstone (1945) almost 20 years after his initial formulation of the law of comparative judgment. Instead of constructing a psychological scale to compare the distances of stimuli, he could focus on the “obverse psychophysical problem” (Thurstone, 1952, 309) of using the scale to makes inferences about the probabilities of known choices. This had obvious applications to the study of consumer preferences, and Thurstone would demonstrate how the method could be used to predict food preferences from hypothetical menu combinations (Thurstone, 1959, 161–169). Others picked up this thread and wove it into a fairly intricate tapestry over the next half a century (Bradley & Terry, 1952; Luce, 1959; Krantz et al., 1971). Thurstone’s most important legacy related to educational measurement may come from the infuence he had on his students and their students down through generations. There is a very impressive academic lineage that can be traced to Thurstone (Wijsen, Borsboom, Cabaço, & Heiser, 2019). Four of his advisees— Ledyard Tucker, Harold Gulliksen, Paul Horst, and Clyde Coombs—would go on to become presidents of the Psychometric Society (Horst had actually cofounded the organization with Thurstone and John Stalnaker), and all made signifcant contributions to educational measurement (Tucker, Gulliksen), factor

The Seeds of Psychometrics

289

analysis (Horst), and psychological scaling (Coombs). Dorothy Adkins, whom Thurstone had hired as a research associate in 1938, would go on to become the frst female president of the Psychometric Society in 1949. Another notable advisee of Thurstone’s was Lyle Jones, who would go on to direct the Thurstone laboratory at the University of North Carolina, Chapel Hill, and play a prominent role in the design of the National Assessment for Educational Progress. Harold Gulliksen, who wrote the frst comprehensive treatment on classical test theory (Gulliksen, 1950), had used Thurstone’s unpublished textbook The Reliability and Validity of Tests as his starting point, and the infuence of Thurstone on his narrative style is hard to miss. Thurstone’s academic “grandchildren” can be said to include Samuel Messick, Bert Green, Warren Torgerson, and Frederic Lord. Perhaps the clearest token of Thurstone’s infuence is that he so frequently serves as a bridge between diferent theories of measurement and psychometric schools of thought. Disagreements about the nature of measurement in the human sciences and the methods by which it should be practiced have raged in the past and this will surely continue (cf. Michell, 1997; Wright, 1997; Sijtsma, 2012). But ask any self-identifed psychometrician to read Thurstone’s publications, and they are likely to emerge with the belief that Thurstone is part of their intellectual lineage.

9.8

Sources and Further Reading

For biographical information about Thurstone, I have relied primarily on Thurstone (1952), Wolfe (1956), Guilford (1957), Jones (2007), and Bock (2007). For biographical information on Thelma Gwinn Thurstone, see Jones (1996). There are three publications by Thurstone related to the content of this chapter that I would recommend as required reading for anyone interested in the study and practice of psychometrics: Thurstone (1925, 1927a, 1928a). In refecting on his career for his 1952 autobiography, it is interesting to see that Thurstone expresses the greatest sense of pride for his 1927 publication reconceptualizing psychophysics for his broadened approach to psychological measurement, as opposed to his more well-known contributions to factor analysis. For those interested in further reading on the developments of educational and psychological scaling that followed Thurstone, I recommend Warren Torgerson’s 1958 book Theory and Methods of Scaling and a chapter I wrote for the edited volume History of Educational Measurement entitled “A History of Scaling and Its Relationship to Measurement” (Briggs, 2021).

Notes 1 In his letter, he sketched the outlines of a proposal for a dam as a solution to the energy needs near Niagara Falls. Thurstone made the proposal with an eye toward the constraint that the dam could not cause a signifcant blight on the environment.

290 The Seeds of Psychometrics

2 Using Thurstone’s original notation, he initially presented the full equation for his law of comparative judgment as either Sk − Sa = X k a σk2 + σa2 − 2rk aσk σa (1927a) or

3

4 5

6

7

8

9

S1 − S2 = x12 σ12 + σ22 − 2rσ1σ2 (1927b). The frst expression was meant to be consistent with a scenario of the constant method, where in any pairwise comparison, it was important to distinguish between the standard stimulus (indexed with the subscript a) and the variable stimulus (indexed with the subscript k). In the second expression, he dropped this distinction in these subscripts because he wanted it to be general to a situation in which any stimulus could be considered either the standard or the variable. The terms Xk a and x12 represent the value of the unit normal deviate that corresponds with a probability of stimulus k > a or stimulus 1 > 2. Interestingly, Thurstone regarded the requirement in Case 1 that the discriminal processes of a single observer follow a normal distribution as a “defnitional” feature of the psychological scale he was constructing but that normality represented an assumption that may or may not hold for a group of observers under Case 2. “A big-time, but honest gambler has to prevent his younger brother from following in his footsteps, and taking up gambling.” See www.imdb.com/title/tt0021420/. “A bootlegger on the run from the law hides out on a college campus. He disguises himself as a student, and soon becomes the school’s star athlete and most popular man on campus.” www.imdb.com/title/tt0020971/ There is an inconsistency in Thurstone (1931b) regarding the choice of unit for the common discriminal dispersion. Thurstone’s equations show a common comparatal dispersion with a unit of 2 (implying that the unit of measurement was fxed to be 1 for an individual discriminal dispersion), but the results of his actual computation are clearly based on a comparatal dispersion of 1 (implying discriminal dispersions of 1/2). Thurstone is actually quite consistent in preferring to set the unit in terms of a target discriminal dispersion as opposed to the comparatal dispersion, so it’s a signifcant inconsistency. Nonetheless, to make the computations match the equations, I present equations that are consistent with the computations Thurstone actually reported, so I exclude the 2 factor. This was actually something of a surprise to Thurstone and his research team since in the movie the gambler had been portrayed in a very sympathetic light. They had expected students to see gambling as a less severe ofense following the flm. Thurstone ofered limited details regarding opposition or disagreements with his proposed methods. He cites no published criticisms, or the grounds for these criticisms, beyond an occasional footnote. But it is clear that his approach was considered controversial and at the very least a major departure from conventional methods for studying social psychology. Perhaps in response, Thurstone ofers a prickly disclaimer in the second paragraph of his 1928a article: “In promising to measure attitudes, I shall make several common-sense assumptions that will be stated here at the outset so that subsequent discussions may not be fogged by confusion regarding them. If the reader is unwilling to grant these assumptions, then I shall have nothing to ofer him. If they are granted, we can proceed with some measuring methods that ought to yield interesting results” (529). As Luce (1994) writes in a review of Thurstone’s original publications on psychophysics, “[o]ne curiosity in all these articles, which was typical of many but by no means all journal articles of that era, is how little cross referencing there is among his articles or, for that matter, to any other literature. Other people are sometimes mentioned, but only rarely are specifc references provided. It is as if all readers were

The Seeds of Psychometrics

291

assumed to be familiar with the entire body of relevant publications. By today’s standards, Thurstone’s articles seem unscholarly” (271). 10 In a refection on Likert’s career following his death, titled “Rensis Likert: Social Scientist and Entrepreneur,” his colleague, Leslie Kish, herself a very accomplished statistician with a specialization in sampling methods (e.g., Kish, 1965), described Likert’s contribution to attitude surveys in the following way. “His scale exemplifes Likert’s pragmatic, engineering approach to problems. He showed by empirical comparisons that his simple 5-point scale—the Likert Scale—gave statistical results very similar to those from the much more cumbersome, though theoretically more elegant, Thurstone procedure” (Kish, 1990, 36).

10 REPRESENTATION, OPERATIONS, AND THE SCALE TAXONOMY OF S. S. STEVENS

10.1

Overview

At the outset of this book, one of the four prospective answers provided to the question, “What is measurement?” was what I labeled as the psychological defnition: that measurement is the “assignment of numerals to objects or events according to rules.” In this concluding chapter of the book, we now turn to the origin story of this defnition. To some extent, we are coming full circle. It was in Gustav Fechner’s psychophysics of 1860 that we began with a formal attempt to ofer up a theory of measurement that could encompass both physical and psychological attributes, and it was in S. S. Stevens’s revival of psychophysics between 1930 and 1970 that an even broader theory would be articulated. As we saw in Chapter 2, Fechner’s program of psychophysical measurement was not without controversy, precipitating the “quantity objection”. Among the pioneers we have encountered in these pages, only Binet seems to have directly acknowledged the quantity objection when conceding that his measuring scale of intelligence could not constitute measurement in a “mathematical” sense because “intellectual qualities are not super-posable” (Binet, 1975 [1909], 102; Binet & Simon, 1916 [1905b], 40–41). In contrast, to the extent that Galton and Spearman were familiar with the quantity objection, it was either not something that concerned them or it was at least not something they felt much need to address in their writing. Galton seems to have believed that since all human attributes are predominantly inherited, it was plausible to conceive of them as if each was the outcome from a sum of random variables. It followed that the cumulative distribution of this sum across individuals would be well approximated by the normal ogive. But the basis for Galton’s assertion

DOI: 10.1201/9780429275326-10

Representation, Operations, and Taxonomy

293

was little more than fat. Missing altogether was a method for testing such an assumption. Spearman’s general model of human cognition had also featured the hypothesis that mental energy was a quantitative attribute, but the testable aspect of Spearman’s theory of two factors had been in the identifcation of this attribute as the general factor that explained the intercorrelation among test scores. There was nothing in Spearman’s methods that allowed for the corroboration that g was a quantity—it was simply treated as such. Thurstone went the farthest in sketching out a falsifable methodological approach for the measurement of psychological attributes. But it largely followed the same tact: that if it was possible to conceptualize order for a psychological attribute—that if one could imagine a person having more or less of some attribute—then with a bit of creative fddling with the cumulative normal distribution function, it would be possible to construct a quantitative scale of measurement for the attribute. Nonetheless, even as they may have been stretching the boundaries of measurement to include psychological attributes, Fechner, Galton, Spearman, Binet, and Thurstone recognized that the methods they were introducing for the measurement of psychological attributes provided only a frst-order approximation to the quality of measurement possible in the physical sciences. Of course, part of the reason they called what they were doing measurement was to give their methods the stamp of good science. “Good science” required mathematical laws, and mathematical laws required measurement. But they also appear to have convinced themselves that there were some parallels to the challenge of measuring intensive attributes in both physics and psychology. They would point to the early attempts to measure attributes such as time, temperature, and electric charge and make the connection that such eforts tended to begin with some observation of order and assumption of an underlying quantity responsible for that order. As the need for measurement of physical attributes was never in the abstract but to satisfy a particular need, imperfect measurement could always be justifed for practical reasons. Why not apply the same justifcation to the measurement of psychological attributes? The drumbeat of measurement terminology that accompanied some of their remarkable methodological innovations surely contributed to a widespread impression that the quantity objection had been circumvented. Galton wrote of measuring a man’s character (Galton, 1884), Binet described his approach to diagnosing children as the application of a measuring scale (Binet & Simon, 1905/1916), and many of Spearman and Thurstone’s most famous publications gave measurement a prominent role in their titles (e.g., Spearman, 1904b, 1904c, 1927a; Thurstone, 1927c, 1928a, 1959). By the time we reach the intelligence testing movement in the United States during the 1920s, the only true common denominator between the sorts of activities being described as measurement in the human sciences was to a great extent the involvement of numeric

294 Representation, Operations, and Taxonomy

assignment according to rules. In this sense, Stevens’s defnition served to codify what was already becoming the de facto conventional wisdom of practice. The controversial features of Stevens’s defnition become apparent when it is compared with the classical understanding that had been articulated by Fechner, that the measurement of a quantity consists in ascertaining how often a unit quantity of the same kind is contained in it. Relative to this, measurement according to Stevens seems to be a diferent and much more general activity. The comparison of a quantity to a unit has been replaced by the assignment of numerals according to rules. And it is “numerals” that are “assigned” directly to objects or events. Yet in physical measurement, it is never objects or events that we measure but some attribute of the object or event, and it is not a numeral that we assign, but rather a number that we discern. Perhaps most importantly, without any further qualifcation as to the “rules” that govern numeric assignment, it appears that the Stevens defnition would include virtually any activity in which a number can be attached to an object or event as a case of measurement. This is, in fact, exactly what Stevens had in mind—the only rule that was inadmissible for measurement was random assignment. If the Stevens defnition is taken at face value, classifying people by eye color, or ranking them by test scores would both be instances of measurement in the same sense as taking their height with a stadiometer or their temperature with a thermometer. But as we will see in this chapter, what Stevens was proposing was not so much a defnition of measurement but a theory of scales of measurement. Stevens defned all numeric assignment as measurement, but measurements could difer in their strength of scale, and this could limit their subsequent utility as tools for mathematical and statistical analyses. According to Stevens’s theory, when, for example, objects have been measured on an ordinal scale, it is not meaningful to compute and compare the arithmetic mean for diferent subsets of objects. The reason for this is that if the numbers attached to objects only convey something about their order, then the magnitude communicated in a comparison of means will be arbitrary. Diferent numbers could be assigned to the objects that would convey the same thing about their order but not the same thing about magnitude. If, prior to Stevens’s theory, a fundamental question of any measurement enterprise involving a psychological attribute had been whether and how the attribute could be measured. Now it would seem the question had changed to, “How does one construct and evaluate a rule for numeric assignment that produces a measurement scale with ratio or interval properties, as opposed to one with only ordinal or nominal properties?” Stevens, as it turns out, had worked out an answer to this question through a program of experimental research that revolved around an operational psychophysical procedure known as direct magnitude estimation. To make sense of Stevens’s approach we will need to frst understand the strategy that Stevens was taking in his eforts to stitch together a theory for measurement that included

Representation, Operations, and Taxonomy

295

a combination of elements from representationalism, operationalism, and pragmaticism (Michell, 1993). Stevens’s operationalism and concept of number have received a thorough treatment in Michell (1999), and I delve into these issues here as well. The new ground that I cover comes in illustrating Stevens’s epistemology as embedded within his research context in psychophysics. It is through this that we are best able to appreciate both the strengths and weaknesses of the empirical commitments that underwrote his definition of measurement and accompanying scale taxonomy. We will see how Stevens went about building an argument for the validity of his psychophysical measurement approach, and how this argument changed as it was subjected to critical scrutiny. One thing we will discover in the process is the emphasis Stevens came to place on the concept of measurement invariance—“Do the results remain the same once the experimenter’s back is turned?” This was a thread that linked Stevens backward in time to Thurstone and Fechner and connects him forward to traditions that emerged outside of psychophysics in the item response theory traditions of Rasch (1960) and Lord (1980). In the next section, I provide some background on Stevens’s academic path and unique features of his personality and circumstances that put him in a position to popularize a broadened conceptualization of measurement in the human sciences. Following this, we will examine the proximal event that led to the Stevens defnition: a reemergence of the quantity objection to psychophysical measurement in the 1930s, personifed by the physicists on the Ferguson Committee, most notably Norman Campbell. Campbell developed a theory of measurement as a numeric representation that featured formal laws for a boundary between what was and what was not measurable. Under Campbell’s theory, psychophysical measurement was impossible. Campbell was at once the infuence and the foil that led Stevens to propose his own theory. From here we turn to a closer look at the way that Stevens used operationalism to arrive at his preferred method for measuring: direct magnitude estimation. We then turn to the controversial nature of Stevens’ approach in the form of major criticisms it received from contemporaries (Garner, 1958; Warren & Warren, 1963; Prytulak, 1975) and, more recently, by Michell (1999) and McGrane (2015). The chapter concludes with a refection on Stevens’s legacy.

10.2

Stevens’s Background

The path that led Stevens to a professorship at Harvard and then kept him there was neither linear nor predictable. Stevens was born and raised in Utah and spent the majority of his youth frst in Salt Lake City and then in Logan, where he was able to enjoy a frontier lifestyle through the extended polygamous family of his grandfather. Neither of his parents had a college education, and he appears

296 Representation, Operations, and Taxonomy

FIGURE 10.1

Stanley Smith Stevens (1906–1973)

Source: © Emilio Segrè Visual Archives.

to have been a relatively disinterested student through high school, although already at a young age he showed a clear interest in debate. Shortly after his high school graduation both of his parents died unexpectedly (his mother from a stroke and his father, at age 42, from a car accident), and the life insurance money from his father’s death was used to frst fund a three-year Mormon mission to Belgium and then his enrollment at the University of Utah. Details about his time at the University of Utah are scarce, but he must have enjoyed some academic success there as within two years, in 1929, he was able to transfer to Stanford University. His ambition had been to become a writer or an artist, but while at Stanford, he was never quite able to decide on a major that suited him. He tended to take courses in philosophy and the humanities and only took a single course in psychology (one that made mostly a negative impression on him). On something of a lark he enrolled in freshman physics, chemistry, and biology courses during his senior year, and this convinced him that his “previous studies were ftting me to talk about anything, but to do nothing” (Stevens, 1974, 430). Somehow, Stevens was able to both graduate from Stanford without having ever declared a major and to gain admission to Harvard Medical School for the fall of 1931 under the conditions that he pay a registration fee of $50 and take a course in organic chemistry over the summer. Stevens would write that “neither

Representation, Operations, and Taxonomy

297

condition seemed attractive” and, while vacillating over the opportunity, missed the registration deadline. The results of a vocational interest test and a conversation with a Stanford professor gave him the push to consider experimental psychology as a candidate for an advanced degree, so after a summer taking three courses in philosophy, psychology, and statistics, respectively, at the University of Southern California, Stevens made his way to Cambridge with his wife, Maxine, whom he had married the year before. Stevens concocted a plan to get himself into Harvard, targeting admission at the School of Education on the grounds that it had the cheapest tuition. Stevens showed up at the education building to register and was told this would be impossible because he had not applied for admission. Somehow, Stevens was able to talk his way in. During his frst year of study, Stevens enrolled in a psychology course on perception that was being taught by E.  G. Boring. Stevens basically funked the course but was given a passing grade because Boring had lower expectations for graduate students outside of psychology. It was between enrolling in the course and failing it that Stevens seems to have discovered the interest in psychophysics that would come to defne his career. He had asked Boring for a problem he could work on outside of class, and Boring had suggested he explore the impact on visual perception of mixing paint colors and then varying the spatial distance from which the colors were to be viewed. Stevens took this on, and after fnding some evidence that supported a functional relationship, he had his frst genuine sense of scientifc discovery. Stevens ultimately rebounded from his failure in Boring’s class by passing a three-hour preliminary exam that served as the gateway for students wanting to pursue a PhD in experimental psychology. With this success in hand, he transferred from the School of Education to the Department of Philosophy in 1932 (where psychology was at the time a subfeld) and became Boring’s advisee. One year later, he had three manuscripts in press and had already defended a thesis, but he had not yet met the requirements for a PhD having only completed a single year in the same department (two were needed). But once again, Stevens’s power of persuasion seemed to have won the day,1 so by the spring of 1933, he had his PhD. In Stevens’s years at Harvard that followed his PhD in 1933, he spent two years as a postdoctoral fellow studying physiology at the Harvard Medical School and then became a research fellow in the physics department. From there he rejoined the psychology department (which had now separated itself from philosophy) as an instructor and was then promoted to an assistant professor position by 1938. By 1946, he became a full professor, and by 1949, he was serving as both the director of a specialized psycho-acoustic laboratory and the more general psychological laboratories, a role he maintained through 1962. In 1962, at his own request, he assumed the title of “Professor of Psychophysics,” the world’s frst by his own estimation. It was during the period from 1934 through 1940 that Stevens would become preoccupied with the nature of measurement and measurement scales (Stevens, 1974, 436). The motivation for this seems to have come from a variety of sources:

298 Representation, Operations, and Taxonomy

his exposure (through Boring) to psychophysics, his awareness of the contested status of psychophysical methods as producing measures of sensation intensity (i.e., the convening of the Ferguson Committee), and his ongoing participation in the meetings, lectures, and discussions on the philosophy of science since his arrival at Harvard in 1931. By 1935, Stevens had begun to carve out a philosophical rationale for measurement in psychology that was premised on his reading of operationalism (Bridgman, 1927), and by 1936, he had published one of his frst applications of the rationale toward the measurement of loudness (Stevens, 1936). This and other experimental studies on the measurement of auditory sensation led to the sone scale for loudness (Stevens & Davis, 1938), something Stevens would later claim as an example of measurement on a ratio scale. By all accounts, “Smitty” Stevens was a difcult man to know (Miller, 1975; Stevens, 1974). An introvert, he was happiest when engaged in the often solitary pursuits that fred his imagination. In his interactions with others as a student, instructor, professor, and administrator he was frequently arrogant and lacking in empathy. The courses he taught were unpopular because he took little efort in making them accessible—he was not a good teacher in a classroom setting and had no interest in changing his ways. He was socially conservative, monetarily stingy, and universally demanding. But in the right setting, he could also be charming and charismatic. If the topic was of intellectual interest, he was giving of his time to a fault. Those who approached him with a problem and could weather the storms of his temperament found themselves in a stronger position in the aftermath. He led by example, typically working 14-hour days and more than willing to get his hands dirty with the intimate details of experimentation and data analysis. He was dogged and opportunistic in a way that got him from point A to point B as expediently as possible. It was this single-mindedness that may explain how Stevens was successful in not only charting a new course for psychophysics but also in promoting and popularizing a new defnition of measurement and scale taxonomy. Stevens also had a wonderful way with words, and to a great extent, it seems that beyond the strength of his actual program of research, it was the combination of preparation, strategy, and rhetorical skill that explain the breadth of his impact. Much like Spearman, Stevens seemed to relish academic combat, which is perhaps why his contributions to the theory and practice of measurement were controversial at the time and remain so many decades later.

10.3 Norman Campbell and the Representational Approach to Measurement 10.3.1

Fundamental and Derived Measurement

By the early 1930s, the British physicist Norman Robert Campbell had begun to exert considerable infuence in emphasizing additivity as the requirement for measurement (Campbell, 1920; 1928). Campbell’s view was that there had to

Representation, Operations, and Taxonomy

299

be more to an understanding of measurement than the comparison of some target magnitude to a unit of the same magnitude. If this was all there was to measurement, Campbell argued, then all that was left to discuss were diferent proposals for the unit to be used as a basis for comparison. Campbell saw the latter as primarily a concern about proper instrumentation, and this left unanswered the question of the necessary and sufcient conditions that had to be in place for an attribute to be measurable in the frst place. In his desire to establish measurement as a distinct activity within the province of physics, Campbell was trying to backward engineer a formal understanding of measurement from the physical laws that had been previously established and applied to demonstrably successful ends. These laws took on a numerical form, typically involving multiplicative relationships between two or more independent variables that predicted a dependent variable. If measurement was to be the precondition for the use of mathematics to model and explain phenomena in the natural world, what was central to the measurability of the variables that entered into these laws? To understand what motivated and infuenced Stevens, we need to frst understand the basics of Campbell’s theory of fundamental and derived measurement. Campbell’s theory established addition as the necessary and sufcient condition for the “fundamental” measurement of a physical property. In this sense what Campbell was proposing was well in line with classical literature on the measurement of extensive attributes (Michell, 1999; Tal, 2020). Campbell asserted that because the magnitudes of physical attributes were not inherently numerical, measurement must occur through a two-stage process in which, during a frst stage, we ascertain that an attribute of interest can be ordered and added and, then, in a second stage, we assign numerals to distinct magnitudes of the attributes. To this end, Campbell would posit the formal conditions for the order and addition of magnitudes as “laws” and the process for assigning numerals to these magnitudes as “rules.” Campbell viewed a law as a factual statement about things that could be observed as part of an experimental investigation, while a rule was the process used to assign numerals to objects conditional on the validity of the law. To make this more concrete, let A, B, and C represent three diferent objects that are subject to scientifc investigation.2 Examples of objects could include mineral specimens such as gypsum, quartz, and topaz; liquids such as honey, water, and vegetable oil; or even diferent electrical currents being transferred through a metallic conductor. For each object in A, B, and C, there is an attribute that we may wish to measure. Insights about this attribute come from the relations that can be observed between the objects. For at least some physical attributes, such as length, the relation to be observed between objects is itself a direct instance of the attribute. This is evident when a collection of iron rods is being compared and we observe whether one is longer than the other. For other attributes of interest, the connection between attribute and relation is less

300 Representation, Operations, and Taxonomy

direct. For example, if the attribute of hardness is of interest, a possible relation to be observed is the act of scratching or being scratched. If mineral A is scratched by B, does it leave a visible mark? If the attribute is density, a possible relation to be observed might be buoyancy—when one liquid is added to another, does it sink or foat? Campbell argued that even when a relation appeared to provide direct evidence of an attribute, that this evidence could only be judged through the operational means by which it was being gathered. This was a notion that would also become central to Stevens’s theory. Campbell’s requirement for order among relations was transitivity. If A > B, B > C then A > C .

(10.1) (10.2)

The two propositions in Equation 10.1 state that if A stands in a certain relation to B and B stands in a certain relation to C, then this implies the result of Equation 10.2 that A stands in a certain relation to C. The specifc relation shown here can be read as “greater than” but only as an observational shorthand; we could have specifed the relation above with respect to < instead. Together, (10.1) and (10.2) impose the requirement that order is both transitive and asymmetric. There is an important exception, and that is the case in which two substances are indistinguishable. In this case, If A ˜ B and A ° B, then A = B.

(10.3) (10.4)

Any set of objects can be ordered if it is possible to establish by observation of both Equations 10.1 and 10.2 and Equations 10.3 and 10.4, the relations of > or < (greater vs. less, higher vs. lower) as well as the relation = (equality). So far, numbers have not entered the picture. For Campbell, this happened in a second step according to a specifed rule of representation in which numerals were assigned to distinct objects to represent a targeted attribute. Now, if all that could be demonstrated was order, Campbell was not terribly interested in a subsequent rule for numeric assignment since it would make the use of the resulting numbers in support of physical laws unacceptably ambiguous. That is, if A, B, and C can only be ordered, the numerical assignment of 1, 2, 3 would be no more or less valid than 3, 2, 1 or 1.1, 1.2, 1.3. To resolve this ambiguity, Campbell emphasized that for fundamental measurement to occur, it must be possible to not only order objects by their relations but also to combine them physically in a manner that was in accord with the mathematical practice of addition. Hence, if A and B are objects with some common property, it must be possible to form a third object such that the property

Representation, Operations, and Taxonomy

301

of the third object is the combination of the two properties of A and B. Campbell expressed this symbolically as A + B = (A + B), where a “+” is meant to indicate a combination and a (A + B) represents a new body that is distinct from A and B in isolation. He also introduced the symbols A′ and B′ to denote distinct bodies that are also equal to A and B. With this in place, Campbell’s second law necessary for the measurement of a magnitude consisted of four conditions: A + B = B + A,

(10.5)

A + B > A ′,

(10.6)

A + B = A ′ + B ′,

(10.7)

( A + B ) + C = A + (B + C ).

(10.8)

Campbell’s choice of notation for his conditions for order and addition, >, A′ should be read as stating that given body A and A′, if A is combined with object B, the new object that results will be diferent from A′ with respect to the property being observed. Along similar lines, the symbolic relationship (A + B) + C = A + (B + C) means that if object A and B are combined frst and then this is combined with C, the resulting body will not be distinguishable from one created by frst combining B and C and then combining A. Just as with his law for order, with the law for addition, there was an associated rule for numeric assignment, and to this rule, Campbell devoted much more attention since it was to be the culminating event of fundamental measurement. This rule required the formation of a standard series, established by choosing an arbitrary object as the standard and assigning it a numeral, and then creating a series by duplicating the standard, combining the duplicates, and assigning a new numeral. Conventionally, the standard might be assigned a 1, then the frst duplicate a 2, and so on. To avoid gaps in the standard series, partial standard series are formed by choosing an object that is smaller than the standard and then repeating the process of duplication and numeric assignment within that standard and so on. The fneness of the standard sequence could be chosen relative to the need for accuracy in the intended use. The practice of measurement, then, becomes a matter of collecting a new series of magnitudes for a particular attribute and then equating any magnitude in the series to a magnitude in the standard series. Campbell used the term numeral instead of number because he viewed a number itself as the physical attribute that is manifested whenever objects of the same kind are aggregated (e.g., the count of 100 vs. 150 cofee beans), while a numeral was the symbol associated with this number. Campbell’s complete list of magnitudes for which fundamental measurement was possible by meeting the conditions in his two laws (which he called

302 Representation, Operations, and Taxonomy

“A Magnitudes”) was a short one: number, mass, length, duration of time, electrical resistance, and angle. The limiting factor was the ability to demonstrate that conditions in Equations 10.5 through 10.8 held. The three examples Campbell would use as illustrative of fundamental measurement through the process of physical addition included length (“placing end to end in a straight line”), mass (“connecting so as to form a single rigid body”), and electrical resistance (“connecting in series”). In contrast, while one could demonstrate order when comparing liquid substances according to the relation foating on top of, only certain liquids can be combined to create a new liquid; others, such as oil and water, will remain distinct. Campbell’s theory did allow for the derived measurement of magnitudes (which he called “B magnitudes”), where the latter was possible for properties that could be shown to have a numerical relationship with one or more fundamental magnitudes. For example, density would be an example of a derived measure defned by the ratio between mass and volume, a ratio that will always equal a constant for any two substances that are equal in density, even if they difer in mass and volume. Campbell’s theory of measurement made an important distinction between the task of establishing that an attribute is measurable and the task of assigning numerals as part of a calibrated standard series. Also important, although Campbell regarded the choice of the standard used to defne a unit for the measurement within standard series as arbitrary, the actual meaning of the unit could not be arbitrary. That is, it had to be possible to trace the rationale for the unit back to a physical law. The law could either come directly from an experiment demonstrating that the attribute in question could be both ordered and added, or it could come from a derived numerical law between other fundamental measures. The implication of this position was that the common practice of measuring temperature using either the Fahrenheit scale (established in 1724) or Celsius scale (established in 1742) was not an example in which resulting magnitudes could be considered examples of a derived measurement. To be clear, Campbell did regard temperature as something that could be cast as a derived measure using Boyle’s Law, but this had nothing to do with the actual units defned on the Fahrenheit and Celsius scales. These units had been defned for mathematical convenience after fxed points were identifed (e.g., 1/100 of the distance between the freezing and boiling points of water for the Celsius Scale). Thus Campbell viewed the thermometers of the time as “arbitrary” measuring devices (McGrane, 2015). In his 1920 book, Physics, The Elements Campbell (1928) had defned measurement as the assignment of numerals to represent properties. He amended this defnition in later work with the clause “in accordance with scientifc laws” (1928) or “to represent facts and conventions about them” (1940) to make clear that not all properties (i.e., attributes) were necessarily measurable: If this proviso were omitted, it would not be obvious why any property should not be measured; but actually some properties are measurable and others are not. . . . Measurable properties, or magnitudes as we shall call

Representation, Operations, and Taxonomy

303

them, are those things about which certain laws are true, the laws that remove the arbitrariness from the assignment of numerals and enable us to truly measure. (1928, 1–2) Campbell had made no attempt to apply his theory of measurement to attributes outside the realm of physics, and since it was reverse engineered with the success of physical laws in mind, it should have come as no surprise to fnd that psychological measurement seemed impossible to justify as either fundamental or derived. In most applications that involved the administration of test or survey instruments to produce “measures,” there was clearly no physical process of addition that could be applied to the psychological attribute of interest and hypothetical laws that brought the psychological property in contact with observable variables were few and far between.

10.3.2

The Ferguson Committee

If there were laws to be found in psychology, the frst, or at the very least, the oldest place to look for them was in psychophysics. The matter became a subject for formal and public academic debate in 1932 when the British Association for the Advancement of Science appointed a committee (i.e., the “Ferguson Committee” since it was chaired by the physicist A. Ferguson) to “consider and report upon the possibility of Quantitative Estimates of Sensory Events.” The committee was composed of 19 members, with a mixture of psychologists and physicists, including Campbell, and it released two reports, an interim report (Ferguson et al., 1938) and a fnal report (Ferguson et al., 1940). From the outset and for the duration of the meetings held by the committee, Campbell and his colleagues were dismissive of the attempts of psychophysicists to stake a claim to measurement and argued that while psychophysical experiments often demonstrated that sensory perceptions based on diferent physical stimuli were ordered (at least on average), they did not (and for that matter could not) establish that, as an attribute, sensory intensity was additive. As Michell (1999) argues in his recount of the debate between the physicists and the psychologists of the Ferguson Committee, the psychologists were unwilling to accept that the measurement of sensation was impossible, yet they were unable to ofer up a competing theory that could explain why and under what circumstances measurement was possible. Interestingly, the psychologists were willing to concede that Fechner’s approach to the measurement of sensation through the stacking of just noticeable diferences was “fallacious” because it only established the potential fnding that diferences in sensation (i.e., sensediferences) were equal, as opposed to establishing units of sensation intensity alone. The counter to the physics camp of the committee along these lines would have been to point to indirect methods that could be used to demonstrate

304 Representation, Operations, and Taxonomy

that smaller sense diferences had additive relations to larger sense diferences, and as Michell suggests, one basis for this kind of counterargument was already available then (Hölder, 1901; Brown & Thomson, 1921), and another would be fully developed over the next three decades (Krantz et al., 1971). Instead, the defense of the measurability of sensation rested solely on the argument that Campbell’s requirements for measurement were too stringent and that, in essence, while what was being done might not be the same things as physical measurement, it could nonetheless prove to be useful. In the end, the committee could not come to a consensus on the measurability of psychophysical sensations. The physicists on the committee were unifed in their objections, objections rooted in Campbell’s orthodox position that measurement was unique to physics. In contrast, the psychologists on the committee maintained the conviction that psychological attributes could be measured, even if the theoretical basis for this conviction was something they had failed to adequately articulate. This stalemate of a result did not go unnoticed by Stevens, especially since the status of his sone scale of loudness had been used as a specifc example in the deliberations of the committee.

10.4

Stevens’s Conceptualization of Measurement

10.4.1

On the Theory of Scales of Measurement

Broadening the Defnition of Measurement Although he would later claim that the ideas therein had been worked out as early as 1940, it was not until 1946 with his publication On the Theory of Scales of Measurement in the journal Science that Stevens set out to at once revive and resolve the debate of the Ferguson Committee. On the Theory of Scales of Measurement is really best appreciated as a four-page rhetorical essay. It begins with Stevens laying out the problem as he sees it (the failure of the Ferguson Committee to come to consensus about the meaning of measurement) and his proposed solution (to broaden the defnition of measurement and instead focus attention on diferent classes of scales). For the remainder of the essay, Stevens presents and illustrates the central features of his scale taxonomy. The novel idea that he introduces is the concept of mathematical group structure and the transformation that would keep comparisons of numbers in a given group structure invariant in their meaning. Some notable excerpts from the opening paragraphs of On the Theory of Scales of Measurement give the reader some appreciation for both the substance of Stevens’s (1946) argument and the style of his rhetoric: For seven years a committee of the British Association for the Advancement of Science debated the problem of measurement. Appointed in 1932 to represent Section A (Mathematical and Physical Sciences) and Section

Representation, Operations, and Taxonomy

305

J (Psychology), the committee was instructed to consider and report upon the possibility of “quantitative estimates of sensory events” meaning simply: Is it possible to measure human sensation? Deliberation led only to disagreement, mainly about what is meant by the term measurement. An interim report in 1938 found one member complaining that his colleagues “came out by that same door as they went in,” and in order to have another try at agreement, the committee begged to be continued for another year. .  .  . Paraphrasing N. R. Campbell (Final Report, p.  340), we may say that measurement, in the broadest sense, is defned as the assignment of numerals to objects or events according to rules. The fact that numerals can be assigned under diferent rules leads to diferent kinds of measurement. The problem then becomes that of making explicit (a) the various rules for the assignment of numerals, (b) the mathematical properties (or group structure) of the resulting scales, and (c) the statistical operations applicable to measurements made with each type of scale. (677) Stevens would later present a more detailed form of this basic argument in a chapter written for the Handbook of Experimental Psychology in 1951 and in a chapter in his book Psychophysics, published posthumously in 1975. Between the two, very little about the theory he was proposing changed. His canny rendition of the deliberations of the Ferguson Committee in Stevens (1946) gave the impression that there was disagreement among both physicists and psychologists on the committee about “what is meant by the term measurement.” In fact, there had been almost lockstep agreement among the physicists. Within the opening page of the article, Stevens was already shifting the terms of debate from the meaning of measurement to the diferences in scale properties that result once one concedes that “measurement exists in a variety of forms.” That measurement must exist in a variety of forms followed from Stevens’s “paraphrasing” of Campbell’s defnition of measurement to emphasize (“in the broadest sense”) that measurement consists of numerical representation by rule. This was indeed an area where Stevens and Campbell shared a similar view; both men distinguished between the concept of number as the counting of aggregates and the coding of numerals. Since both Campbell and Stevens also saw the purpose of measurement as facilitating the mathematical comparisons of objects, the point of measurement was to fnd a way to represent nonnumeric properties of objects with numerals. However, as we have seen, Campbell had been clear that the rules for numeric assignment followed from scientifc laws about the relations between objects, and that in this view true measurement existed only for a restricted class of physical attributes. Whether Stevens’s rhetorical slipperiness in this regard was recognized by contemporary readers of Science at the time is unclear, but it was quite the sleight of hand to imply that what he was proposing was consistent with Campbell’s theory.3

306 Representation, Operations, and Taxonomy

The Stevens Scale Taxonomy The four main rows of Table 10.1 list each of the four distinct scale types that Stevens was introducing: Nominal, Ordinal, Interval, and Ratio. For each scale type, the next three columns present (1) the rule to be used to assign numbers4 to objects, (2) the mathematical group structure supported by the scale, and (3) the “permissible” statistics that could be used to summarize and compare objects or events with respect to each scale type. Let’s take each of these scales in turn. A nominal scale is used to categorize objects into numerical values such that any fxed value represents the same attribute of the object; in other words, the numbers are used to determine if one object is equal to another with respect to some designated attribute. The numbers have only symbolic value as characterizing different categories, and the point of such scales is usually to compare frequency distributions for each category. For example, people can be grouped by gender or country of birth, and specifc numbers are then assigned to each group for convenience. Such comparisons will remain invariant under any one-to-one transformation from the original scale, making the numbers of the nominal scale the most fexible in their mathematical group structure. In contrast, a nominal scale is the most restrictive in the kinds of statistical comparisons it can support, as it will only TABLE 10.1 The Stevens Taxonomy for Measurement Scales

Scale

Basic Empirical Operations (“rules for numeric assignment”)

Mathematical Group Structure

Permissible Statistics (for invariant comparisons)

Nominal

Determination of equality

Number of cases Mode Contingency correlation

Ordinal

Determination of greater or less

Interval

Determination of equality of intervals or diferences Determination of equality of ratios

Permutation group x′ = f (x) f (x) is any one-toone substitution Isotonic group x′ = f (x) f (x) is monotonic increasing function General linear group x′ = ax + b

Ratio

Similarity group x′ = ax

Median Percentiles Mean Standard deviation Product-moment correlation Coefcient of variation

Note: “Measurement is the assignment of numbers to objects or events according to rule. The rules and the resulting kinds of scales are tabulated [above]. The basic operations needed to create a given scale are all those listed in the second column, down to and including the operation listed opposite the scale. The third column [mathematical group-structure] give the mathematical transformations that leave the scale form invariant. Any number x on a scale can be replaced by another number x’ where x’ is the function of x listed for a given row. The fourth column lists, cumulatively downward, examples of statistics that show invariance under the transformations of the third column” (Stevens, 1958, 385).

Representation, Operations, and Taxonomy

307

be meaningful to compare the frequency distributions of objects marked by each unique number. An ordinal scale results from assigning numbers to objects to represent a common attribute among the objects that can be ordered. The numbers on an ordinal scale convey information about both equality and order, but diferences in magnitudes among the numbers are not readily interpretable. The mathematical group structure associated with an ordinal scale is still quite fexible but more restrictive than a nominal scale. Any order-preserving transformation applying to the numbers of the scale will retain the same information about order. Given this, only statistics such as the median and the frequency distributions are sensible ways to compare two or more sets of objects. An interval scale is one for which diferences in magnitudes along the scale can be shown to be equal, but for which the choice of a zero point is arbitrary. In this sense, an interval scale is one that allows for a ratio scale of diferences (the diference between any two points on the scale can be compared as a ratio to a reference distance) but not a ratio scale of magnitudes. The numbers on an interval scale have a mathematical group structure much more limited than those from an ordinal scale in that only linear transformations to the scale would leave the information conveyed about diferences on any two points along the scale with the same meaning. So long as only linear transformations are entertained, statistics that are premised on magnitudes such as the mean, the standard deviation, and the product-moment correlation remain meaningful. Stevens (1946) described the interval scale as the form of a scale “that is ‘quantitative’ in the ordinary sense of the word” (679); however, he was never clear about what this “ordinary sense” entailed. Finally, the ratio scale is cast as the hallmark of measurement in the physical sciences. Under the classical understanding of measurement, a ratio scale is the natural outcome any time that the magnitude of a quantity is expressed as some multiple of a standard unit of the same quantity. All Campbell’s fundamental magnitudes were instances of ratio scales. Numbers on a ratio scale have the most restrictive mathematical group structure. For such scales, only mathematical transformations based on a multiplicative constant are acceptable. In contrast, all types of statistical summaries are fair game. Note that there is an inverse relationship between the restrictiveness of mathematical group structure and the toolbox of statistical procedures that can be applied for purpose of making numerical comparisons among objects. The idea of a mathematical group structures for numbers and the distinctions to be made between them in terms of the mathematical transformations to a specifc group structure can be traced to Stevens’s interactions with the mathematician G.  D. Birkhof. The application of this concept to make distinctions among scales of measurement was a genuinely novel contribution to representational theory. From a mathematical perspective, there was little to criticize, and there was an important message his taxonomy conveyed to physical and

308 Representation, Operations, and Taxonomy

social scientists alike: garbage in, garbage out. The numeric properties of the “variables” in a data set cannot be taken for granted. Without an understanding of how the variables have been measured, and the constraints on their interpretation, the inferences that result when these variables are used as the inputs or outputs of a statistical model have great potential to be misleading. In promoting his taxonomy in the ensuing years, Stevens provided a variety of specifc examples of variables that fell under each of the four scale types.The examples contained a mix of variables associated with physical and nonphysical attributes, shown in the separate columns of Table 10.2. The physical example of a nominal scale was generic, characterized as the assignment of numerals to distinguish between different models and classes. The nonphysical example was more specifc and involved assigning numbers to football players on the same team. The only role of the scale is to simplify the identifcation of players, and so long as each player has a unique two-digit number that fts on their uniform, the value of that number is of no consequence. As such, it would be meaningless to make inferences about the order, difference, or ratio between the numbers on any two football uniforms.This example would later prompt an acerbic response from Lord (1953, 1954), who invented a scenario in which the comparison of the means of football uniform numbers would serve to settle a dispute, even though the numbers were only on a nominal scale. Lord would famously declare that “the numbers don’t remember where they came from.”The debate that ensued was a nice example of how the focus placed on scales and statistics in Stevens’s theory could serve to obfuscate the more fundamental questions about the target object and attributes of measurement. Once this is clarifed, Lord’s entertaining critique loses most of its force (see Zand Scholten & Borsboom [2009] for details). Stevens’s physical examples of ordinal scales were the hardness of minerals and the grades of leather, lumber, and wool. For his nonphysical example, Stevens provided the example of the raw score coming from an intelligence test. This example might have been jarring for psychologists at the time, since by the TABLE 10.2 Examples Provided by Stevens of Measures With Diferent Scale Properties

 

Physical Examples

Nonphysical Examples

Nominal

Assignment of type or model number to classes Hardness of minerals Grades of leather, lumber & wool Temperature (Celsius) Position on a line Calendar time Potential energy Length, density, numerosity, duration, Temperature (Kelvin)

Numbering of football players

Ordinal Interval

Ratio

Intelligence test raw score Intelligence test standard scores

Loudness (sones) Brightness (brills)

Representation, Operations, and Taxonomy

309

mid-20th century, most of them were treating the scores that resulted from the administration of tests and surveys as though they communicated something more than order. It was already common for psychologists to employ descriptive and inferential statistics that involved the computation of means, standard deviations, and Pearson correlation coefcients. Were comparisons among objects on this basis meaningless? Strictly speaking, the answer might have been yes, but Stevens left some wiggle room, writing that this kind of “illegal statisticizing” could be invoked with the “pragmatic sanction” that it leads to useful results. Taking this one step further, Stevens’s nonphysical example of a measure on an interval scale was an intelligence test score expressed in standard deviation units, presumably following the normalization approaches that had been introduced and applied by Galton, Thorndike, and Thurstone. According to Stevens (1946), “most psychological measurement aspires to create interval scales, and it sometimes succeeds” (679). I will return to what Stevens seems to have meant by this statement in the next section. When it came to ratio scales, Stevens would provide instances of Campbell’s fundamental and derived magnitudes from the physical sciences. Weight, length, and resistance are examples of the former, and density, force, and elasticity are examples of the latter. It is really at this point in Stevens (1946), in the presentation of nonphysical examples of ratio scale measures that Stevens most pointedly parted ways with Campbell. He frst rejected additivity as the sufcient condition for fundamental measurement: Physical addition, even though it is sometimes possible, is not necessarily the basis of all measurement. Too much measuring goes on where resort can never be had to the process of laying things end to end or piling them up in a heap. (Stevens, 1946, 680) Next, he put his cards on the table: Ratio scales of psychological magnitudes are rare but not entirely unknown. The Sone scale discussed by the British committee is an example founded on the deliberate attempt to have human observers judge the loudness ratios of pairs of tones. The judgment of equal intervals had long been established as a legitimate method, and with the work on sensory ratios, started independently in several laboratories, the fnal step was taken to assign numerals to sensations of loudness in such a way that relations among the sensations are refected by the ordinary arithmetical relations in the numeral series. As in all measurement, there are limits imposed by error and variability, but within these limits the Sone scale ought properly to be classed as a ratio scale. (Stevens, 1946, 680)

310 Representation, Operations, and Taxonomy

This was a bold claim. Campbell had put forth specifc laws that had to hold before it could be argued that an attribute of an object was measurable via numeric assignment. These were laws that applied to experimental observations of objects, and in the case of fundamental magnitudes, the attributes themselves were physically manipulable. Did Stevens have some analog to this kind of experimental manipulation for a psychological attribute? How had Stevens devised new psychophysical methods that produced measurement on a ratio scale without recourse to some indirect demonstration of additivity? This had, after all, been at the intent of Fechner’s methods, culminating in the concatenation of just noticeable diferences (jnds) to form a ratio scale. Yet even the psychologists on the Ferguson Committee had conceded that Fechner was at best only successfully measuring sense diferences on an interval scale. What put Stevens’s sone scale on the same footing as the other canonical ratio scales of the physical sciences?

10.4.2

Operationalism

One reason that Stevens may have thought it was defensible to somewhat blithely assert that intelligence tests were instances of either an ordinal or interval scale was that he, in contrast to Spearman, did not regard intelligence as an attribute of a human being that exists independently of the test used to measure it. Instead, if students were tested and assigned scores based on the number of items answered correctly and then ranked, it constituted a de facto procedure that resulted in, at a minimum, an ordinal scale. If the scores were transformed according to a diferent operational procedure (e.g., Thorndike’s and Thurstone’s normalization approaches), this resulted in an interval scale. In taking this position, Stevens was synthesizing and adapting three developments in the philosophy of science with which he had become familiar as part of his weekly discussions with his local “philosopher’s roundtable” at Harvard in the 1930s: behaviorism, operationalism, and logical positivism. Most infuential was the newly proposed philosophy of science known as operationalism (also some times called operationism; Bridgman, 1927). Bridgman had argued that when one speaks of a scientifc attribute to be measured, the attribute is entirely synonymous with the set of specifc operations and/or instrumental procedures that are used to measure it. So, as an oft-used example, the attribute of length has no meaning beyond the specifc lines and numbers designated on a ruler. Stevens articulated his own version of operationalism in two of his earliest publications (Stevens, 1935a; 1935b), and these laid the philosophical foundation for his defnition of measurement and theory of scale types (Hardcastle, 1995). To count as an operational procedure for psychological measurement, two things had to be present. First, all psychological concepts involved would have to be defned solely in terms of the concrete behaviors and operations

Representation, Operations, and Taxonomy

311

that humans routinely execute. Second, the purpose of the procedure must be to elicit some anticipated human act of diferential response (i.e., discrimination). As applied to the context of the psychophysical research that Stevens was just starting to conduct on hearing in the 1930s, operationalism implied a belief that the measurement of auditory sensation is made meaningful through experimental procedures that elicited concrete behaviors from exposure to varying physical stimulus magnitudes. Furthermore, these behaviors were to come in the form of judgments when making comparisons about these stimulus magnitudes. While both Stevens and Campbell appeared to agree that measurement involved numerical representation, the things eligible for representation under their two theories were diferent. For Campbell, the impetus for measurement was the desire to represent facts and conventions about an attribute through the relations between objects that could be experimentally manipulated to have more or less of the attribute. The basis for the manipulation was a theoretical understanding of a prospective scientifc law. Even when the attribute itself was not physically manipulable (as in the case of a derived magnitude), it was one whose existence was acknowledged as independent of eforts to measure it. In Stevens’s operational measurement, since psychological attributes are not directly manipulable, what we can know about them is limited by the discriminatory responses we observe, and these depend on the operational procedure we use to elicit them. In this Stevens’s argument was not so dissimilar to Fechner’s observation that all measurement, whether of extensive or intensive attributes, requires the specifcation of a measurement formula that relates the attribute to spatial extension. But Stevens took this much farther in rejecting all metaphysical speculation about underlying psychological attributes. Fechner had developed procedures for outer psychophysics in the hope of eventually discovering the process of inner psychophysics. In contrast, to the extent that inner psychophysics was inaccessible to observation, Stevens saw little point in contemplating it. A problem with operationalism, at least in its strictest rendition, and one of the reasons it fell out of favor as a philosophy of science (see Chang, 2019), is that in the absence of any way to independently observe some attribute of interest, there will be as many measures of the attribute as there are unique operational procedures being applied. This would seem to move us backward to a time when all measurement was a local afair, contingent on decisions about standard units that were often the province of the ruling class. Stevens did not view the potential for a proliferation of operational measures of the same attribute as a problem, primarily, it seems, because a key tenet of his brand of operationalism was that “science is knowledge agreed upon by members of society” and agreement was fostered through critique and debate. In this vision, when multiple operational measurements of the same property or attribute are available, the one with the better

312 Representation, Operations, and Taxonomy

argument behind it would win out over time. Nor is it entirely clear that Stevens rejected the idea that there was, in fact, an attribute of an object or event (e.g., sensation of loudness) that was the common focus of multiple operational measurement procedures. Campbell’s objection to the possibility of psychological measurement was rooted in the need to demonstrate both order and physical additivity. Stevens’s counter to this was that both order and additivity could be operationally manufactured. For example, the initial psychophysical experiments that led to the sone scale involved the use of the method of fractionation. Subjects were presented with diferent tones of constant frequency that varied in their intensity, and then for each tone they would be asked to manipulate the intensity of the tone until they perceived it to be half as loud (Stevens, 1936). As Stevens (1936) would describe it, [w]ith such a scale the operation of addition consists of changing the stimulus until the observer gives a particular response which indicates that a given relation of magnitudes has been achieved.  .  .  . Obviously, in the application of this criterion we are limited by our ability to devise operations for the determination of the fractional magnitudes of sensation. (407) Stevens (1951) ofered another example, this time in the context of measuring weight, whereby a ratio scale could be similarly manufactured, and this time without recourse to direct physical addition. In this instance he invented a scenario in which the objects to be weighed would explode if they came into contact, thus precluding their combination in the same pan of a balance beam. Given the availability of three diferent balance beams, each of which could be used to provide nonnumerical insights about equality, order, diferences, and ratios in succession, one could build up to a ratio scale by frst arranging one set of objects by order, then another by diference, and fnally another by ratio. In this manner, it would be possible to demonstrate additivity without every physically combining two or more objects. By implication, according to Stevens’s theory, if an attribute could be measured on a ratio scale, it must also be the case that equal intervals along the scale convey the same information about the diferences in magnitude of the attribute. It also follows that if one could build up to a ratio scale, then more generally it must always be the case that one can also build down to an interval scale. This would prove to be an argument that would eventually come to cause Stevens some problems. But before we get to this, let’s take a closer look at how Stevens went about enacting his operational approach to measurement in psychophysics.

Representation, Operations, and Taxonomy

10.5 10.5.1

313

The Process of Operational Measurement The Method of Magnitude Estimation

At its origins, psychophysics had been premised on a theory that the measurement of sensory intensity could only be accomplished indirectly. For example, Fechner’s use of the constant method (see Chapter 2) involved what was essentially a two-step process. In a frst step, one would collect the results from an experiment in which a subject was asked to discriminate between two distinct magnitudes of a physical stimulus. In a second step, these results would be analyzed using a normal cumulative distribution function to locate thresholds on the physical continuum that corresponded to jnds, and these jnds would become the units of a measurement scale for sensation. Thurstone (see Chapter 9) had generalized this classical approach in a number of ways, most notably by applying it to situations in which the stimuli were not physical magnitudes but plausibly orderable. If viewed through the lens of Stevens’s theory of scale types, what Fechner had attempted to establish with jnd units would correspond to a ratio scale, while what Thurstone had proposed with his discriminal dispersion units was an interval scale5 In contrast, Stevens came to view the operational process of measurement in psychophysics as the mapping or equating of stimuli that a person observes onto an internal continuum defned by the number system. Because Stevens took this to be something that was ingrained in human beings by an early age, he believed one could skip the unnecessary step of inferring numeric magnitude from judgments about order and instead simply ask subjects to produce these estimates directly.Table 10.3 summarizes the different operational procedures that Stevens would either describe or apply as a method for constructing either interval or ratio scales for the subjective sensations thought to be produced by exposure to physical stimuli. Stevens had initially developed the sone scale in1938 using the “ratio production” approach he called fractionation. However, by the mid-1950s, he was convinced he had developed a better operational approach, one he referred to as the method of magnitude estimation. In Stevens’s experiments, the physical stimulus of interest was the intensity of a tone with a fxed frequency, whereby intensity is determined by varying sound energy, which is measurable as sound pressure in decibels (sound pressure is a logarithmic function of sound energy). A subject in the experiment is given headphones and sits in front of a console that has two switches. When the left switch is turned on, a standard tone is passed into the subject’s headphones, and the intensity of this tone remains the same throughout the experiment. When the right switch is turned on, a variable tone is passed through the subject’s headphones, and the intensity of this tone is under the control of the experimenter. The crux of the experiment was to ask subjects to “directly” estimate the magnitude of the variable sound intensity, as they perceived it, on a scale in which only a reference unit (the standard, e.g., tone of the

314 Representation, Operations, and Taxonomy TABLE 10.3 Direct Response Methods for Operational Measurement

Operational Procedure

Description in Context of Sensation of Loudness

To Create an Interval Scale Method of Observer hears two tones that difer in their intensity (i.e., sound energy) Bisection and is asked to locate a third tone that is in between the two, creating two intervals that are equal in their perceived change in loudness. Method of Same as above, but observer is asked to locate multiple intervals between Equisection the two tones. Categorization Observer is exposed to a random series of tones that vary in intensity and asked to place them into discrete, predetermined numeric categories. To Create a Ratio Scale Ratio Observer is exposed to pairs of tones that difer in their intensity and Estimation is asked to estimate the ratio of the tone with the higher perceived intensity to that of the lower perceived intensity. Ratio Fractionation: Observer hears a tone and then is asked to produce a second Production tone that is some fraction of the frst tone (1/2 is most common option). Multiplication: Same as above, but observer is asked to produce a second tone that is some designated multiple of the frst (twice is the most common option). Magnitude Standard Fixed by Experimenter: Observer is exposed to tone that is Estimation designated as the standard and identifed with a fxed number (e.g., 10). Observer is then exposed to series of tones of random intensity and asked to assign the tone numbers relative to the standard. Standard Picked by Observer: Observer decides on tone to use as standard as well as the number associated with the standard. The rest of procedure stays the same. Magnitude Inverse of magnitude estimation. Observer is given magnitude values and Production then asked to produce tones that are perceived to be equal to the magnitude.

standard stimulus triggered by the left switch) had been established in advance. The instructions given to the subject are reproduced here: Instructions. The left key presents the standard tone and the right key presents the variable. We are going to call the loudness of the standard 10 and your task is to estimate the loudness of the variable. In other words, the question is: if the standard is called 10 what would you call the variable? Use whatever numbers seem to you appropriate—fractions, decimals, or whole numbers. For example, if the variable sounds 7 times as loud as the standard, say 70. If it sounds one ffth as loud, say 2; if a twentieth as loud, say 0.5, etc. •

Try not to worry about being consistent; try to give the appropriate number to each tone regardless of what you may have called some previous stimulus.

Representation, Operations, and Taxonomy



315

Press the ‘standard’ key for 1 or 2 sec. and listen carefully. Then press the ‘variable’ for 1 or 2 sec. and make your judgment. You may repeat this process if you care to before deciding on your estimate. (Stevens, 1956, 3)

Stevens would typically conduct an experiment such as this with between 10 and 20 subjects. The variable tones would be randomly presented (without replacement) both within and across subjects. Stevens would also typically have subjects repeat the experiment two times using diferent tones as the standard tone but with the rest of the instructions kept exactly the same. The results from one implementation of this experiment as reported by Stevens (1956) are displayed in Figure 10.2, which plots the sound pressure levels of the variable stimulus tones on the x-axis against the numeric magnitude estimates made by

10

1.0

1000

STANDARD

100

STANDARD

0.1

10

1.0 18 OBSERVERS

MAGNITUDE ESTIMATION (SQUARES)

MAGNITUDE ESTIMATION (CIRCLES)

100

MEDIANS INTERQUARTILE RANGE

EQUATION OF LINES : L = kI 0.3

20

0.1

40 60 80 100 120 SOUND PRESSURE LEVEL IN DECIBELS

FIGURE 10.2 Results From Direct Magnitude Estimation of Loudness Experiment (Stevens, 1956). From the American Journal of Psychology

Source: © 1956 by the Board of Trustees of the University of Illinois. Used with permission of the University of Illinois Press.

316 Representation, Operations, and Taxonomy

18 observers on the y-axis. The circles and squares represent the median values while the lines extending from them represent the interquartile range; each line is intended to represent the best ft for magnitude estimates relative to two different source stimuli (80 and 90 decibels) that were given to observers as the standard (and assigned a numeric value of 10). Between about 1954 and 1966, Stevens and his students conducted experiments using a variety of direct response scaling techniques not just for loudness, but also for the subjective senses of duration, brightness, vibration, and electric shock, among others. In these experiments, he typically went to great lengths to control for sources that he worried might exert biases on the responses of his subjects. For example, in Stevens (1956), he lists nine diferent factors he had discovered to be essential to the success of his direct magnitude estimation approach.6 He also considered many diferent variants of direct response scaling. In his original work that led to the Sone scale using the method of fractionation, subjects had been asked to either produce designated ratios by varying the intensity of one tone by a fraction of another or to estimate designated ratios when presented with tones of varying intensities. In his later experiments with variants of direct magnitude estimation, he eventually developed an approach in which subjects were free to select their own tone as a standard, as well as the number they could assign to this standard (Stevens, 1957).

10.5.2

The Power Law

Much like Fechner, Stevens took results such as the ones depicted in Figure 10.2 as evidence of a law-like relationship between stimulus and response. However, in place of the logarithmic relationship Fechner had derived, Stevens proposed a more general power law in the form Y = kX β ,

(10.9)

where k and X are defned as in Fechner’s Law as a scaling constant and a physical magnitude, respectively, and Y represents a behavioral response in numeric code. The coefcient β represents the nature of the proportional relationship between stimulus and response. When β < 1 the curve relating stimulus to response will be concave, such that increasingly large changes in X are necessary to produce the same change in Y. As β approaches 1 the curve becomes increasingly linear and then as β > 1, it becomes convex such that smaller and smaller changes in X will produce the same incremental change in Y. While Fechner’s law could be derived from the assumption that a unit change in Y (i.e., the jnd) was a constant proportion of X, Stevens would argue that the more general principle entailed by the power law was that equal stimulus ratios tend to produce equal sensation ratios, not equal sensation diferences. This is easy enough to illustrate by considering the two equal 2:1 ratios on the sound

Representation, Operations, and Taxonomy

317

energy scale of 20 to 10 and 40 to 20. Setting k = 1 for convenience and letting β = .3 (the value Stevens typically found in his auditory experiments), the 20.3 40.3 resulting sensation ratios are = = 1.23. In contrast, if the relationship 10.3 20.3 between stimulus magnitude and sensation magnitude followed Fechner’s Law, log (20) log (40) ≠ . Hence, under Fechner’s law, equal stimulus ratios would log (10) log (20) produce diferent sensation ratios. The validity of the power law in the context of the operational procedure of direct scaling formed the crux of Steven’s argument that it was possible to measure psychological attributes on a ratio scale. In the example that I showed earlier in Figure 10.2 for sensory perception of loudness, what was being plotted is a logarithmic transformation of sound energy onto the decibel scale, which relative the underlying power law formulation of Equation 10.9 leads to Y = log k + βlogX

(10.10)

such that the slope of the line is equal to the exponent of the power function. Although Stevens would often plot results using Equation 10.10 primarily for convenience (since it is easier to visually distinguish diferences in the slope of a linear function relative to the changing slope of a nonlinear function), this particular functional relationship expresses diferences in numerical magnitudes on an interval scale. However, because the scale derives from an operational procedure in which subjects are being asked to estimate ratios (how many times bigger or smaller a variable tone sounds relative to a standard tone) as opposed to diferences (how much bigger or smaller is a variable tone compared to a standard tone) Stevens (1957) came to refer to this as a “logarithmic interval scale.” By Stevens’s logic, if it were true that the subjective magnitudes of human subjects followed a power function of the stimulus magnitudes to which they were exposed, then if stimulus magnitudes were on a ratio scale, so were subjective magnitudes. Because Stevens believed he had amassed evidence that the power law held whenever magnitude methods of scaling were employed, he concluded that one could think of people as instances of measuring devices. Once “calibrated” (given careful sets of instructions in a controlled experiment), they could be used to transduce physical stimuli onto a psychologically relevant scale.

10.5.3

Cross-Modality Matching

By the mid-1950s, Stevens had begun to place greater emphasis on what he took to be the universality of the power function relationship between direct magnitude estimates and physical stimuli across a wide range of diferent sensory

318 Representation, Operations, and Taxonomy

modalities. Although it was not always clear how many times and in what exact sense his experiments with diferent modalities were being reproduced, he took the stability of the exponents estimated for many of these modalities to be evidence in support of the law’s invariance. His most interesting argument along these lines came from cross-modality matching studies (Stevens, 1959, 1966a; Stevens, Mack, & Stevens, 1960). In such studies, subjects would be exposed to a stimulus magnitude such as sound pressure, but then asked to match their perception of the magnitude of that stimulus with respect to a diferent physical stimulus, such as light energy, vibration, force, and the like. Stevens’s logic was that if two diferent physical stimuli variables X1 and X2 each follow a power function, each with a specifc exponent and scaling factor with respect to the subjective magnitudes, Y1 and Y2 , then Y1 = k1X 1β1

(10.11)

β2 2

(10.12)

Y2 = k2 X .

Asking subjects to make judgments about the magnitude for X1 in terms of X2 was tantamount to the following substitution: k2 X 2β2 = k1X 1β1 .

(10.13)

Taking the log of both sides, log (k2 ) + β2 log (X 2 ) = log (k1 ) + β1 log (X 1 ). From which it follows that if Equations 10.11 and 10.12 hold, and if the parameter values for each equation are known (i.e., having been previously estimated through direct scaling procedures), then if a cross-modality matching experiment is performed in which the variable X1 is the stimulus (e.g., sound, manipulated by the experimenter) and the variable X2 is the response (e.g., brightness, manipulated by the subject), then the result should have a predictable functional form: log (X 2 ) = [log (k1 ) − log (k2 )] +

β1 log (X 1 ), β2

(10.14)

which is the equation of a straight line, with an intercept of log(k1) − log(k2) β and a slope of 1 . In performing a number of these experiments, Stevens was β2 able to compare the observed slope to the slope that was expected based on previously estimated values and found that they tended to correspond. He took this as his most compelling evidence that he had discovered an invariant lawful relationship between stimulus and sensation.

319

Representation, Operations, and Taxonomy

10.5.4

The Role of Argument and Pragmatism

The argument that Stevens mounted was that a psychological attribute (subjective sensation) could be measured on a ratio scale rested on three pillars. The frst pillar was the “discovery” of an operational procedure (direct magnitude estimation) for which the numerical estimates of sensation elicited from subjects could be mathematically related to objective physical magnitudes. The second pillar was the demonstration that this mathematical relationship, a power function, was reproducible for estimates across experiments within the same modality, and predictable for estimates across modalities within the same experiment. The frst pillar was postulation of a psychophysical law; the second pillar was proof that the law was invariant. The third pillar, which to Stevens was clearly just as important as the frst two, was the ability to make a compelling argument that the measure was in some sense better than competing alternatives. Figure 10.3 is typical of the sort of comparison Stevens would make between three competing approaches of psychophysical measurement: (1) category scales, (2) ratio scales, and (3) discriminability scales (e.g., Stevens & Galanter, 1957). A category scale is essentially something akin to the Likert approach, in which

35

7

30

6

25

5

7

15

6

D

JN

5

Y OR

G TE CA

4

4

E UD

IT

N AG

3

3

M

10

2

2

5

1

1

0

FIGURE 10.3

0

1

2 3 DURATION IN SECONDS

4

MAGNITUDE ESTIMATION

20

CATEGORY SCALE

NUMBER OF JND

DURATION

0

Jnd Scale, Category Scale and Magnitude Estimation Scale for Apparent

Duration. Source: Stevens (1959).

320 Representation, Operations, and Taxonomy

observers are asked to locate stimuli in predetermined ordinal categories (see the third row of Table 10.3).The ratio scales corresponded to Stevens’s direct scaling approaches (e.g., the last six rows of Table 10.3). Finally, discriminability scales were premised on the classical methods of psychophysical measurement using jnds as originally proposed by Fechner and generalized by Thurstone. In Figure 10.3 the sensory modality in question was the subjective perception of the duration of time. Stevens would make similar comparisons for the perception of the apparent thickness of an object, vibration frequency, and white noise. In each case, it always appeared that the category and jnd scales were concave relative to the stimulus magnitudes, while the relationship between the ratio scales based on direct magnitude estimation and the stimulus magnitudes were usually convex.Which scale offered the more valid measure of sensation? Since the three kinds of scales are nonlinearly related, it seems clear that they must measure diferent things. Each is probably a valid scale of something. From our present point of view the interesting question is which, if any, of these scales seems to measure what we would like to mean by subjective magnitude? Which scale best describes how a sensory impression grows with stimulus input? . . . For a simple continuum like apparent duration, the answer is probably not too difcult. The results of magnitude estimation show that a stimulus lasting 2 sec seems psychologically about half as long as a stimulus lasting 4sec. This seems reasonable. But the jnd scale suggests that a stimulus lasting less than 1 sec should appear to be half as long as one lasting 4 sec. This seems less reasonable. (Stevens, 1959, 998) What we can see in Stevens’s argument here are several paradigmatic features of his operational approach to measurement. On one hand, Stevens had convinced himself, through successive experimentation and investigation,7 that he had homed in on the optimal procedure for realizing a psychophysical law: that any given stimulus X could be related to behavioral response Y according to a power function with exponent β. However, if other operational procedures led to curves with diferent values of β or even to a logarithmic relationship (which itself can often be well approximated by a power function) and if these functions also demonstrated invariance when replicated, then they were equally valid operational measures. Each simply measured something diferent, even though they were premised on the same stimulus. Stevens’s magnitude estimation procedure was, he would willingly concede, not the only way to operationally measure some targeted sensory impression, but it was the better way to measure, at least in part because it produced results that were more “reasonable.” The pragmatic extension of this logic was that the better measure would be the one that proved most useful in practice.

Representation, Operations, and Taxonomy

10.6 10.6.1

321

Criticisms A Logical Inconsistency and an Operational Problem

Notwithstanding Stevens’s confdent pronouncement that his sone scale of loudness ofered measurement on a ratio scale in Stevens (1946), by the early 1950s, problems with the validity of the sone had begun to emerge. The most prominent of these emerged from a series of experiments and studies led by a former student of Stevens, W. R. Garner (Garner & Hake, 1951; Garner, 1954, 1958, 1959). Garner had demonstrated that diferent methods of direct response scaling led to scales of loudness that had noticeably diferent functional relationships with sound energy. Garner was one of the frst to point out a logical inconsistency in Stevens’s methods of establishing a ratio scale, one that had been foreshadowed in his claim that a ratio scale could be built up without the need for physical addition (Stevens, 1951). Consider a scale of subjective loudness that has been established by asking subjects to produce either ratios (as in the method of fractionation) or magnitudes (as in the magnitude estimation approach with fxed standard described in Section 10.5.1). If it is, in fact, true that equal physical magnitude ratios produce equal subjective sense ratios, it should also be true that equal physical magnitude diferences produce equal subjective sense diferences. This follows directly from the logic of the scale taxonomy Stevens had proposed, since it is claimed that a ratio scale inherits all the relational distinctions of the less restrictive scale types. If a sone is a unit of a ratio scale, then it must be the case that the diference between 5 and 10 sones has the same meaning in terms of subjective loudness as the diference between 10 and 15 sones. And this was a testable proposition. For example, subjects could be exposed to pairs of tones that represented equal intervals of sound pressure (e.g., 10 and 14 dB, 14 and 18 dB, etc.) and then asked to identify a tone with a loudness level in between the two tones (the method of “bisection”). Other variants of this approach instructed subjects to identify multiple equal intervals for a series of tones (the method of “equisection”). However, in conducting these experiments, Garner and others discovered that scales constructed using methods that required subjects to make judgments about equal intervals disagreed, often substantially, with those requiring subjects to make judgments about equal ratios. Garner (1958) would ultimately argue that indirect methods of psychophysical scaling based on comparative judgments were preferable to Stevens’s direct methods because they led to more reliable results, even if they only produced a measure with interval scale properties. Again, this is precisely the scenario that one might anticipate from a theory of measurement premised on operationalism. Since there is no underlying attribute independent of the operational procedure, two measures of loudness based on two diferent operational procedures need not agree. Each one operationalizes a diferent measure; hence, in arguing for the validity of one over the other,

322 Representation, Operations, and Taxonomy

one must rely on external criteria, such as reliability or correlations with other variables, to make a good case. But the problem here was that the two operational procedures being compared were logically related. If a frst procedure were successful in generating a ratio scale by asking subjects to invoke ratio judgments, the intervals within the resulting scale should agree with ones that would have resulted from an alternate procedure that asked subjects to make interval judgments. Garner and others had not only shown that psychophysical scaling using the methods of bisection and fractionation led to inconsistent results but also that the methods were easy to bias with slight changes in the setup of the experiment and that they could produce results that varied dramatically from one human subject to another. In one line of response, Stevens would essentially concede that these biases were present in the experimental designs of both these methods but that he had found a better experimental procedure for creating a ratio scale, namely, his method of magnitude estimation (Stevens, 1959).8 Having found that ratio production using the method of fractionation was problematic, Garner (1958) and Warren and Warren (1963) had criticized Stevens for treating the sone scale (initially created using the fractionation method) as the gold standard against which all other scaling approaches were to be compared. In pointing to the newly devised method of magnitude estimation as an improvement over the method of fractionation, Stevens had, in efect, moved the goalposts. He had discovered a new gold standard, and hence, the critiques that pertained to the old one no longer applied.9

10.6.2

An Axiomatic Critique

For all the political and cultural turbulence experienced in the United States during the 1960s, it represented a time of considerable theoretical and methodological innovation in quantitative psychology. In particular, the work of Patrick Suppes, Duncan Luce, David Krantz, and Amos Tversky led to the development of an axiomatic approach to representational measurement. A hallmark of this approach was the use of set theory and formal proofs to establish the connection between an empirical relation system, on one hand, and a numeric representation system, on the other. The logical framework that this approach established led to a number of formal expositions and critiques of Stevens’s method of direct magnitude estimation. The critiques of Stevens’s approach to measurement that emerged were respectful of the insights provided by his body of experimental research but came to more modest conclusions about the generalizations about measurement that this research supported (Luce & Galanter, 1963; Luce, 1972). In particular, Krantz et al. (1971) and Krantz (1972) would frame Stevens’s psychophysical research as having established the makings of testable hypotheses but ones that had not yet been explicated in a manner that allowed for a systematic test.

Representation, Operations, and Taxonomy

323

Of all these critiques, my favorite is the one by Shepard (1981). It is well worth reviewing some of his central points here, as none of them hinge on questions that could be raised about the quality and replicability of Stevens’s experimental fndings. Here we can accept as a starting point that when an experiment is conducted in which subjects are exposed to a physical stimulus, X, and asked to produce associated numerical magnitude estimates, Y, that when the results are averaged they will ft the power function Y = kXβ, and the value estimated for β is more or less invariant to the choice of subjects and the specifc stimulus values to which they are exposed in the experiment. Now, granting all this to be true, Shepard poses three questions: (a) What is being measured? (b) Has a psychophysical law been established? and (c) What type of scale is being constructed? Stevens would have answered such questions with a single sentence: In psychophysics, we measure subjective magnitudes on a ratio scale because physical continua are related to perceptual continua according to the simple law that equal stimulus ratios generate equal subjective ratios. Shepard came to an entirely diferent conclusion, arguing that it was only really sensible to speak of measuring the “transduction” parameter β, that it was premature to speculate about the functional form of a psychophysical “law,” and that to the extent that a scale was being created for the value of Y, the scale was at best ordinal, as opposed to interval or ratio. Importantly, Shepard would argue that the theory underlying Stevens’s power function was underspecifed, in that, at best, it only related an observed input, a physical magnitude, to an observed output, a number. While the status of the physical magnitude as a number on a ratio scale was well established, the status of the numbers reported by the subject (or subjects) participating in a magnitude estimation experiment was not. That the median numeric estimate corresponding to each of a set of stimulus magnitudes can be predicted using a power function is an interesting fnding, but it establishes nothing about the scale properties about the subjective sensory magnitudes that are of interest. Stevens had taken on face value that if a subject assigns, say, the number 30 to one stimulus and the number 60 to a second, the second stimulus stood in a ratio relationship to the frst as twice as much. But beyond the fact that they had been instructed to assign numbers in this manner, there was no way to verify that diference between 30 and 60 conveyed a diference in magnitude as opposed to a diference in order. As Shepard (1981) would write, [w]ithout a theory, then, how can we assume that the numbers profered by a subject—any more than the numbers indicated on the arbitrary scale of the thermoscope—are proportional to any underlying quantity? It serves no purpose, here, to insist that the subject is expressly instructed to give a number that is proportional to the underlying psychological magnitude. For, in the absence of any independent access to that psychological magnitude, how could we be certain that the subject is following our

324 Representation, Operations, and Taxonomy

instruction? How, indeed, could we ever have taught the subject to make such reports correctly in the frst place? Surely, it would be a risky business to assume, just because an instruction was issued, that it was followed. (What if we were to instruct the subject to repeat back a 24-digit number, or to report the direction of a weak magnetic feld?) (30) Although Shepard—and Treisman (1964) before him—did not frame it in these terms, to a great extent the point being made was that the input–output conceptualization of the psychophysical process was indeterminate without the specifcation of a psychological attribute, some underlying psychological magnitude, that could be used to model the mental process that leads to a numeric outcome. The problem was that once this was granted, the functional form of the psychophysical law was no longer as obvious as Stevens wished to believe. Shepard’s argument recalls Fechner’s original distinction between inner and outer psychophysics. That is, the psychophysical process in a magnitude estimation experiment can be represented by the following two equations that were frst discussed in Chapter 2: θ = f 1 (X ) Y = f 2 (θ ) ,

(10.15) (10.16)

where X and Y are the observable inputs and outputs of a psychophysical experiment involving direct magnitude estimation and θ is a latent psychological magnitude. Within the context of Shepard’s critique, Equation 10.15 stipulates an initial mental process by which a subject maps a physical magnitude onto a psychological magnitude, and equation 10.16 stipulates a second mental process in which the subject maps this psychological magnitude onto a number. The two forms of the functions involved in this mapping, f1 and f2 , are left unspecifed. Now θ is unobservable, but if we substitute Equation 10.16 into Equation 10.15, we get Y = f 3 (X ) = f 2 { f 1 (X )} .

(10.17)

It follows then that what we can observe after a psychophysical experiment is the combined result, f3 of two diferent functional mappings, f1 and f2 . If Stevens’s reported results are taken as accurate, and f3 is a power function, then a possible explanation for this would be that f1 and f2 are also power functions. But unless f2 has an exponent of 1, it would be a mistake to interpret the exponent for the f3 power function as characterizing the sensation or subjective perception of a physical stimulus. Instead, it jointly characterizes both sensation and the method, specifc to the experiment, that a subject uses to transduce the stimulus into a numeric value. It would be possible, for example, that f1 represents a

Representation, Operations, and Taxonomy

325

logarithmic function while f2 is an exponential function. One can argue, as Stevens surely would have, that two power functions combining to form a third represents the most parsimonious and plausible explanation, but such an argument cannot be settled solely on empirical grounds. As shown previously, Stevens eventually argued that the strongest case for the validity of his power law came from the results from cross-modality matching studies. Shepard would apply a similar logic to show that even in this context there was an inherent indeterminacy of the resulting functional form. In a cross-modality matching study, only the stimulus input to the psychological magnitude stage of the psychophysical process represented is relevant. Consider two diferent physical stimuli, X and X ′, the frst of which (e.g., sound pressure) is to be matched to a second (e.g., force). Each stimulus is mapped to a psychological magnitude by θ = f (X ) ,

(10.18)

θ ′ = f ′ ( X ′) .

(10.19)

A beneft of this approach is that it avoids the need to represent the process from psychological magnitude to numerical magnitude. The instruction to a subject in a cross-modality matching study is to match θ with θ′, which, based on Equations 10.18 and 10.19 is equivalent to asking for f(X) = f ′(X ′). It follows that X = f −1 { f ′ (X ′)} ,

(10.20)

X ′ = f ′−1 { f (X )} .

(10.21)

If two power function are being matched, then ′

αX β = α ′X ′ β .

(10.22)

But this is mathematically equivalent to log α + β log X = log α ′ + β ′ log X ′.

(10.23)

The implication is that if both physical stimuli invoke the same functional law in the way they get mapped to a psychological magnitude, then the actual form of the law remains arbitrary—in this case, it could be either the power function (θ = αX β ) or the logarithmic function (θ = log α + β log X ). Shepard and others took greatest issue with a lack of conceptual clarity regarding the theoretical primitives that fgured into Stevens’s psychophysical law. If measurement was the process of matching objects or events to numbers as Stevens claimed, it was not always obvious what constituted a unique “object” or “event.” Hence, if there was a scientifc consensus being reached by the early

326 Representation, Operations, and Taxonomy

1970s, it was that Stevens had presented experimental fndings that lent themselves to interesting hypotheses about the process of sensation, yet these fndings fell short of establishing a new and comprehensive paradigm for psychological measurement (Luce, 1972; Krantz, 1972; Krantz et al., 1971).

10.6.3

Michell’s Realist Critique

The most comprehensive critique of Stevens’s theory of measurement can be found in Joel Michell’s Measurement in Psychology. Stevens is the archvillain in Michell’s critical history, and the gist of Michell’s very compelling argument goes something like this. Up through the mid-20th century, the claim that psychological attributes could be measured (the “measurability thesis”) was a claim that would need to be defended with at least a tacit understanding of measurement in its classical sense. Under the classical defnition, measurement is no more or less than the assessment of the quantitative structure for some attribute of interest. If the attribute is quantitative, it is measurable; otherwise, it is not. The quantity objection held that psychological attributes, because they lack homogeneity and are therefore not additive, do not have quantitative structure. This objection, frst articulated in response to Fechner’s psychophysics, posed an obstacle to the eforts of Galton, Spearman, Binet, and Thurstone to claim that they could measure psychological attributes, and while few of them acknowledged this explicitly, it was still the case that their aspirations for the measurement of human attributes were to meet or at least approximate the same requirements as the measurement of physical attributes. In the telling by both Michell (1999) and McGrane (2015), the confict between the measurability thesis and quantity objection came to an important crossroads with the convening of the Ferguson Committee between 1932 and 1940. There was a choice between two paths. On path A, the challenge of overcoming the quantity objection would be embraced, and untapped possibilities related to the scientifc study of mental attributes would be uncovered. On path B, the quantity objection would be obscured by ofering up a broadened defnition of measurement, thereby shifting the focus away from the classical conception of quantity. Stevens, in this sense, can be described as the personifcation of path B. As Michell (1999) puts it, [o]perationsim commits an elementary confusion: it confuses ‘the act or process of measuring with the object of the act, namely the quantity in question’ (Byerly, 1974, 376). Once this confusion is exposed, Stevens’ defnition of measurement is revealed for the charade it is. In general, psychologists have declined to acknowledge this. Stevens had given them what they wanted: a defnition which, if accepted, made the quantity objection magically invisible. Mainstream interest in the defnition of measurement efectively ceased with receipt of that ‘gift’. (177)

Representation, Operations, and Taxonomy

327

At the heart of Michell’s objection to Stevens’s approach to measurement is the question of whether measurement is an activity that involves some discovery about things in the world that exist independent of our attempt to measure them and whether numbers are a natural part of the world or merely abstract symbols humans invented in service of mathematical reasoning. Michell (2005) characterizes himself as a scientifc realist, and although this can mean many diferent things (e.g., see Chakravartty, 2017), it involves a commitment to the idea that science is the attempt to gain knowledge about both observable and unobservable aspects of the world. Hence if measurement is to be scientifc, it must be about the discovery not only of things that we can observe directly but also of things that we can only speculate about. Therefore, scientifc realism and any strict adherence to operationalism are incompatible. Beyond this, Michell rejects what he considers the “amateurish” and “deeply ignorant” formalist interpretation of number Stevens had advanced in which mathematics is characterized as a game of signs and rules, with numbers a human invention to facilitate playing of the game (Michell, 1999, 176; Stevens, 1951, 1958). In this sense, Michell’s verdict on Stevens’s theory of measurement is understandable, given that under this theory the specifcation of an operational procedure is in and of itself considered a sufcient condition for measurement (i.e., operationalism) and because it holds that numbers come into existence through symbolic assignment (i.e., a representational theory of measurement). There are other reasons for taking issue with the path B that Stevens was taking on the heels of the Ferguson Committee. That is, even if we grant the representationalism of Campbell and the operationalism of Bridgman their status as alternatives to a more realist classical conception of measurement, Michell makes a strong case that the mixing and matching that Stevens was doing in promoting his defnition of measurement was something that was done for selfserving reasons. The version of representationalism in the Stevens defnition had highlighted Campbell’s assignment “rules” while conveniently dropping all mention of Campbell’s “laws.” And the actual role of operationalism was easy to miss when just reading the Stevens defnition or looking over the presentation of his scale taxonomy. The operationalist signal in the Stevens defnition came primarily in what was missing: any mention of an attribute of measurement or the structure of the attribute. Only those who paid attention to how Stevens enacted his theory in practice would have understood the job his defnition of measurement was meant to do. Absent this deeper understanding, the broadened defnition of measurement Stevens was promoting, taken in isolation, provided a convenient justifcation for quantitative psychologists to claim that what they were doing was measurement, without ever having to take up the quantity objection. Michell (1997) sees the Stevens defnition (and its positive reception) as symptomatic of a deeper pathology among quantitative psychologists: that they not only neglect the “scientifc task” of empirically evaluating the measurability thesis in favor of the “instrumental task” that takes the thesis as a given,

328 Representation, Operations, and Taxonomy

but that they also actively erect barriers that would keep people from recognizing that the quantitative status of psychological attributes is an open question in need of investigation. Michell’s critique is more nuanced than my rough summary, and it is well worth reading. However, I think it is also the case that Michell is a bit too hard on Stevens. To begin with, the “scientifc” approach that Michell (e.g., 1990, 1999, 2020b) has in mind to evaluate a measurability thesis will always depend on limits in our ability to make fne distinctions between empirical relationships, and in many (if not most) cases, our observations will be mediated by some operational procedure for realizing the relationships. In the context of measuring temperature, Chang (2004) points to the problem of nomic measurement. The only way to discover if there is a law that explains the relationship between the expansion of some fuid Y (such as mercury) and a change in temperature, X, is to conduct experiments in which we systematically vary X and then observe Y. But such an experiment presupposes the availability of an instrument capable of transducing X onto Y. Therefore, in the absence of good instrumentation, the precise nature of the functional relationship X = f (Y), and whether it supports a quantitative structure for the attribute in question, will remain a mystery. The failure to detect a relationship could point to a problematic theory, a problematic instrument, or both. This is really quite similar to the problem Shepard pointed out in his critique of Stevens’s operational approach to measurement, but it applies with equal force to the scientifc realist. See Sherry, 2011, for a much more complete exposition of this point and for a compelling critique of the particulars of Michell’s classical realist perspective on measurement. In this sense, the scientifc and instrumental tasks of measurement are always commingled, which is what makes measurement such a challenging activity. Michell rightly objects that the Stevens defnition rules in almost everything as measurement. But Michell gives his readers little sense of the sustained program of research Stevens undertook to make the case, within his operational theory, that human sensations could be measured on a ratio scale as opposed to an interval or ordinal one. Stevens had, at the very least, put forward a falsifable theory when he claimed to measure, for example, sones on a ratio scale with his method of fractionation. When this was falsifed by Garner, this led him to propose what he considered a superior method (direct magnitude estimation). Stevens comes across as a charlatan in Michell’s critique, and there may be some truth to this: although his theory of measurement was a bit fimsy when submitted to close inspection, he went to great lengths to promote it with pomp and circumstance in his rhetorical delivery. But if Stevens’s defnition itself could be likened to the “beneft of theft over honest toil” (Michell, 1999, 174), his experimental research program could not. And it was the defciencies in Stevens’s theory that played a distinct role in motivating Suppes and Zinnes (1963), Luce and Tukey (1964), and Krantz et al. (1971). The contribution of additive conjoint measurement by Luce and Tukey, in particular, is what Michell

Representation, Operations, and Taxonomy

329

regards as accomplishing what Stevens failed to do when faced with the quantity challenge posed by Campbell and the Ferguson Committee—provide an empirical framework that can be used to test a measurability thesis.

10.7

Stevens’s Legacy to Measurement

On January 18, 1973, at the age of 67, Stevens passed away unexpectedly in his sleep while attending a conference in Vail, Colorado. He had, without a doubt succeeded in rejuvenating both interest and research in psychophysical measurement (Galanter, 1974). At the time of his passing, numerous scales of subjective magnitude for physical modalities had been named and established. These included the sone scale of loudness, the veg scale of heaviness, the mak scale of visual length of lines, the chron scale of time intervals, the gust scale of taste, the bril scale of brightness, the var scale of visual area, the numer scale of visual numerousness of groups of spots, the fut scale of auditory futter, the samp scale of electric shock, the pak scale of subjective fnger span, the mel scale of subjective pitch of tones, and the enc scale of subjective inclination of a line (Stevens & Galanter, 1957; Warren & Warren, 1963). Stevens (1966b) had also been convinced that the methods he had championed for direct magnitude estimation and cross-modality matching could be extended to develop measures for afective constructs of “social consciousness” along the lines initially suggested by Thurstone, but now with scales that would have ratio properties. His methods of direct scaling had become part of the canon of psychophysics,10 and he was delighted that students and colleagues were starting to refer to his power function relating physical stimulus to subjective response as “Stevens’s Law.” Yet by the late 1960s, it was already becoming clear that neither his scales nor his methods were being met with the universal acclaim he might have expected. Operationalism had fallen out of favor, and psychological research was moving out of the laboratory and into the messier reality of social settings. Stevens had already lamented in his 1974 autobiography that “psychology has deserted psychophysics” and the decades that followed his death have mostly borne this out. In particular, the various subjective magnitude scales he and others had developed between 1950 and 1970 have seen no widespread application. And although (implicit) methods of direct-response scaling have become widespread throughout the human sciences, when subjects are asked to rate subjective stimuli in terms of discrete categories (as in a survey with items that follow a Likert response scale), these category-based methods were the ones that Stevens himself regarded as a weak substitute for his magnitude estimation approach. Stevens (1946, 1951) had ofered a simple and encompassing defnition of measurement as the assignment of numerals to objects or events according to rules. Michell (1997) argues that the basic syntax of this defnition, measurement is the assignment of X to Y according to Z, attained widespread acceptance in

330 Representation, Operations, and Taxonomy

psychology soon after Stevens proposed it, and little seems to have changed in the years that have followed Michell’s review. Similarly, Stevens’s hierarchy of ratio, interval, ordinal, and nominal scales has attained widescale usage, even in the physical sciences. In this chapter, I have shown why and how Stevens thought it was possible to place some event that began as a physical stimulus onto a ratio scale that served as a measure of subjective sensation. Why he thought it was possible stemmed from his contention that all measurement was tantamount to numeric mapping and that this mapping could be defned operationally by the relations between the events induced by an experimental procedure. How he thought it was possible was through demonstration that his method of magnitude estimation produced invariant results in accordance with a power law. With these pieces in place, the basis for many of the claims Stevens was making in On the Theory of Scales of Measurement is easier to discern, even if the claims themselves do not necessarily generalize or hold up to critical scrutiny. One thing that is worth appreciating is that in his psychophysical program of research Stevens never showed much interest in operations developed to attach numbers to represent individual diferences among people. In using people as measuring instruments in his sensory experiments, the object of measurement was the sensory “event,” not the human. Although he had described intelligence tests as examples of measures on either ordinal or interval scales, if he had given this much thought, we see little evidence thereof in his published work. After all, the analogy between “operations” when students sit for a test and “operations” when subjects sit for a psychophysical experiment is not so clear. In a psychophysical experiment, the procedure requires the subjects themselves to produce a numeric coding or judgment for each stimulus, and it is the aggregation of these results that leads to the formation of a scale that serves to measure physical events in subjective magnitudes. What is the analogous operational procedure in a testing context? Surely, students are not themselves involved in a process of direct magnitude estimation. Is the operational procedure related to the “judgments” students make when selecting an answer to each multiple-choice test item? If so, the operations occur with respect to each element in a student’s string of item responses, and statements about order would need to pertain to the pattern of coded elements in this string, not just to a total score. And if this were the case, then it would be a considerable challenge just to establish that the sum of item scores could be characterized using an ordinal scale, let alone an interval one. Perhaps instead we are to regard the measurement operation as the process of ranking students according to the number of items answered correctly on a test, making the results ordinal by defnition? Even putting to the side the problem inherent to operationalism that every intelligence test would represent a novel measure of intelligence, these are questions about his theory that Stevens never addressed in the context of measuring the psychological attributes of people, which is

Representation, Operations, and Taxonomy

331

how his defnition is often applied, as opposed to measuring the psychological perceptions of events as transduced by people, which was what motivated him to develop the theory. It is somewhat ironic that there are aspects of Stevens’s theory of measurement that leave people dissatisfed for opposite reasons. To the classically oriented measurement theorist, the Stevens defnition is both vacuous and far too broad, but there is some respect for the importance of the group structures and conditions for meaningful scale transformation that he was introducing (e.g., Michell, 1986). To the more pragmatically oriented statistician, there is usually a willingness to embrace Stevens’s broadened defnition paired with a negative reaction against the imposition of restrictions in statistical computations based on strength of the scale (e.g., Lord; 1953; Lord & Novick, 1968). Those in the latter camp who fnd themselves critical of the prescriptiveness of Stevens’s scale taxonomy would be well served to pay closer attention to the operational commitments that were implicit in the theory of measurement and the problems these commitments can entail. But this operationism was played down by Stevens (1946, 1951), and to this day, few who cite Stevens are familiar with it or its intended application in the psychophysical context. Many of the criticisms of Stevens’s scale taxonomy are of the “have your cake and eat it too” variety (e.g., Velleman & Wilkinson, 1993). They are all too happy to accept the broadest possible defnition of what constitutes measurement, and to justify it on the pragmatic grounds that a measurement result is “useful,” without clarifying what such a justifcation needs to entail. In many instances, the use of a measurement in question is to make comparative judgments about magnitude. If so, then questions about the boundary between an attribute of an object for which it is only sensible to speak about order and an attribute for which it is also sensible to speak of magnitude with respect to diferences or ratios become entirely relevant. In this regard, Stevens (1968) was not wrong when he remarked that however much we may agree that the statistical test cannot be cognizant of the empirical meaning of the numbers, the same privilege can scarcely be extended to experimenters. .  .  . A statistician, like a computer, may perhaps feign indiference to the origin of the numbers that enter into a statistical computation, but that indiference is not likely to be shared by the scientist. (849) There are many in the human sciences who either practice measurement or make use of measurement results that have never given much thought to what measurement entails, yet they have some cursory awareness that variables difer in their numeric scale properties. This is, to be sure, a major part of Stevens’s legacy. But there is more to Stevens than what meets the eye if one’s only

332 Representation, Operations, and Taxonomy

acquaintance with him is his 1946 essay introducing his theory of measurement. Stevens was no armchair philosopher but a man of action, and in his own work, the epistemological case he built for measurement on a ratio scale was undertaken with considerable and sustained experimentation, analysis, and argument. This should also be properly regarded as the more positive aspect of Stevens’s legacy, especially for those who by their practices are implicitly following in his operationalist footsteps. He concluded his autobiography with the sentiment “What has there been except the joy of search and solution in the contest to decipher nature’s ways and the great good fun of carrying on?” and while one can debate the merits of the methods he employed for his search and solution, there can be no doubt that Smitty Stevens relished the contest.

10.8

Sources and Further Reading

My sources for Stevens’s biography come from Stevens (1974) and Miller (1975). For good retrospectives on Stevens legacy to measurement and psychophysical scaling, see Luce (1972), Killeen (1976), Matheson (2006), Ward (2017), and the contributions in the edited book Sensation and Measurement assembled in his honor by Moskowitz, Scharf, and Stevens (1974). The most pointed critique of Stevens’s theory of measurement is in Michell (1999). For the reader interested in Stevens in his own words, I would recommend Stevens (1946, 1956, 1959, 1961, 1966a, 1966b, 1968) and Stevens and Galanter (1957), which collectively not only provide a good sense for his views on measurement, scaling, and psychophysics but also provide insights into his program of experimental research. At the end of Section 10.6, I allude to a fully formalized representational theory of measurement (RTM) due primarily to the collective eforts of Patrick Suppes, Duncan Luce, David Krantz, and Amos Tversky. A good starting point for this work is Suppes and Zinnes (1963), but this can sometimes be hard to fnd. The magnum opus for RTM is the three volume Foundations of Measurement series. Clif (1992) famously called RTM the “revolution that never happened” in part because it is highly abstract. However, I fnd that the frst chapter of Krantz et al. (1971) provides a nice introduction and gives a very good sense of the general approach. For another “translation” of RTM for a broad audience, see Michell (1990). Finally, in this and the previous chapter I have often alluded to pragmatism as a rationale that both Thurstone and Stevens used as a justifcation for measurement without going into any detail about what a pragmatic perspective entails. For a terrifc exposition I highly recommend the book A Pragmatic Perspective of Measurement by David Torres Irribarra.

Representation, Operations, and Taxonomy

333

Notes 1 He wrote, “They fnally let me have the degree, though, because administrators would rather change their rules than look silly” (Stevens, 1974, 434). 2 Campbell was inconsistent in the terms he would use to describe A, B, and C. He tended to describe them as either bodies, substances, or systems and sometimes alternated between these terms within neighboring sentences. To simplify matters, I use the more neutral term object throughout. 3 Campbell himself never had the opportunity to engage in debate with Stevens. By the time of Stevens’s publication, Campbell was in poor health following the destruction of his home in southern England near the end of World War II. He died a few years later in 1949. 4 In what follows, for ease exposition, I drop the distinction Stevens was making between numerals and numbers. 5 However, Thurstone did explore, on more than one occasion, methods for establishing a zero point for an absolute magnitude scale. 6 Examples: Use a standard whose level does not impress the O [observer] as being either extremely loud or extremely soft; call the standard by a number like 10 that is easily multiplied and divided; randomize the order of presentation; make the experimental sessions short. 7 See Stevens (1955a, 1955b, 1956, 1957, 1958, 1959, 1961, 1964, 1966, 1968, 1971, 1975). 8 “In my own experience, the method called magnitude estimation has generally proved superior to fractionation; so much so that unless some unexpected evidence turns up, I would anticipate no further need to use the method of fractionation for scaling purposes” (Stevens, 1959, 996). 9 A diferent tact that Stevens could have taken would have been to back of the claim that loudness could be measured on a ratio scale, since the latter, as characterized by Stevens himself, required evidence of both the equality of ratios and the equality of diferences. In contrast, the logarithmic interval scale (see Equation 10.14) would only require evidence related to equality of ratios, and this is what Stevens had been able to demonstrate through his experimental results. But this was a tact Stevens was unwilling to take because he regarded the establishment of a ratio scale as the goal of the psychophysical measurement. Somewhat curiously, he would describe the logarithmic interval scale as “mathematically interesting” but “empirically useless” (Stevens, 1957). 10 And this is still the case into the 21st century; see Kingdom and Prins (2016).

REFERENCES

AERA, APA, & NCME. (2014). Standards for Educational and Psychological Testing. Washington, DC: American Psychological Association, Inc. Aikens, H. A., Thorndike, E. L., & Hubbell, E. (1902). Correlations among perceptive and associative processes. Psychological Review, 9(4), 374–382. Andrich, D. (1978). Relationships between the Thurstone and Rasch approaches to item scaling. Applied Psychological Measurement, 2(3), 451–462. https://doi.org/10.1177/ 014662167800200319 Ayres, L. P. (1911). The Binet-Simon measuring scale for intelligence: some criticisms and suggestions. Psychological Clinic, 5(6), 187–196. Bagley, W. C. (1922a). Educational determinism; or democracy and the I.Q. School and Society, 15, 373–384. Bagley, W. C. (1922b). Professor Terman’s determinism: a rejoinder. The Journal of Educational Research, 6(5), 371–385. Ball, W. W. R. (1889). A History of the Study of Mathematics at Cambridge. Cambridge: Cambridge University Press. Ballou, D. (2009). Test scaling and value-added measurement. Education Finance and Policy, 4, 351–383. Bartholomew, D. J. (1995). Spearman and the origin and development of factor analysis. British Journal of Mathematical and Statistical Psychology, 48(2), 211–220. https://doi. org/10.1111/j.2044-8317.1995.tb01060.x Bartholomew, D. J., Deary, I. J., & Lawn, M. (2009a). A new lease of life for Thomson’s bonds model of intelligence. Psychological Review, 116(3), 567–579. https://doi. org/10.1037/a0016262 Bartholomew, D. J., Deary, I. J., & Lawn, M. (2009b). The origin of factor scores: Spearman, Thomson and Bartlett. British Journal of Mathematical and Statistical Psychology, 62, 569–582. Bell, J. (1912). Recent literature on the Binet tests. Journal of Educational Psychology, 3(2), 101–110. Bell, J. (1916). The infuence of Alfred Binet. Journal of Educational Psychology, 7(10), 611–612.

References

335

Binet, A. (1890a). The perceptions of lengths and numbers in some small children. In R. Pollack & M. Brenner (Eds.) (1969), The Experimental Psychology of Alfred Binet. New York: Springer. Binet, A. (1890b). Children’s perceptions. In R. H. Pollack & M. W. Brenner (Eds.) (1969), The Experimental Psychology of Alfred Binet. New York: Springer. Binet, A. (1890c). Studies of movements in some young children. In R. H. Pollack & M. W Brenner (Eds.) (1969), The Experimental Psychology of Alfred Binet. New York: Springer. Binet, A. (1909/1975). Modern Ideas about Children. Translated by S. Heisler. Binet, A., & Henri, V. (1896). Psychologie individuelle. Leanne Psychologique, 2, 411–465. Binet, A., & Simon, T. (1916). The Development of Intelligence in Children (the Binet-Simon Scale). Leopold Classic Library. Translated by Elizabeth S. Kite. https://www.amazon. com/Development-Intelligence-Children-Binet-Simon-Scale/dp/B01LNUM9EK/ ref=sr_1_10?dchild=1&keywords=The+Development+of+Intelligence+in+Children&q id=1625712695&sr=8-10 Boake, C. (2002). From the Binet—Simon to the Wechsler—Bellevue: Tracing the history of intelligence testing. Journal of Clinical and Experimental Neuropsychology, 24(3), 383–405. https://doi.org/10.1076/jcen.24.3.383.981 Bobertag, O. (1911). Über Intelligenzprüfungen (nach der Methode von Binet und Simon). Zeitschrift fur angewandte Psychologie, 5(2), 105–203. Bock, R. D. (1997). A brief history of item theory response. Educational measurement: issues and practice, 16(4), 21–33. Bock, R. D. (2007). Rethinking Thurstone. In R. Cudeck & R. C. MacCallum (Eds.), Factor Analysis at 100: Historical Developments and Future Directions. Mahwah, NJ: Lawrence Erlbaum Associates. Bock, R. D., & Jones, L. V. (1968). The Measurement and Prediction of Judgment and Choice. San Francisco: Holden-Day. Bonser, F. G. (1910). Reasoning Ability of Children in Grades 4, 5, 6. New York: Teacher College. Boring, E. G. (1920). The logic of the normal law of error in mental measurement. American Journal of Psychology, 31(1), 1–33. Boring, E. G. (1921). The stimulus-error. American Journal of Psychology, 32(4), 449–471. Boring, E. G. (1923). Intelligence as the tests test it. New Republic, 35(6), 35–37. Boring, E. G. (1950). A History of Experimental Psychology (2nd ed.). New York: Appleton-Century-Crofts. Borsboom, D. (2005). Measuring the Mind: Conceptual Issues in Contemporary Psychometrics. Cambridge: Cambridge University Press. Borsboom, D. (2008). Latent variable theory. Measurement, 6(1–2), 25–53. https://doi. org/10.1080/15366360802035497 Borsboom, D., Mellenbergh, G. J., & Van Heerden, J. (2004). The concept of validity. Psychological Review, 111, 1061–1071. Bradley, R. A., & Terry, M. E. (1952). Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika, 39, 324–345. Brennan, R. L. (2001a). An essay on the history and future of reliability from the perspective of replications.  Journal of Educational Measurement, 38(4), 295–317.  https:// doi.org/10.1111/j.1745-3984.2001.tb01129.x Brennan, R. L. (2001b).  Generalizability Theory. Springer. https://doi.org/10.1007/ 978-1-4757-3456-0 Bridgman, P. W. (1927). The Logic of Modern Physics. New York: Macmillan.

336 References

Briggs, D. C. (2013). Measuring growth with vertical scales. Journal of Educational Measurement, 50(2), 204–226. Briggs, D. C. (2021). The history of scaling and its relationship to measurement. In B. Clauser (Ed.), The History of Educational Measurement. New York: Routledge. Briggs, D. C., & Domingue, B. (2013). The gains from vertical scaling. Journal of Educational and Behavioral Statistics, 38(6), 551–576. Briggs, D. C., Maul, A. M., & McGrane, J. (forthcoming). On the nature of measurement. In L. Cook & M. Pitoniak (Eds.), Educational Measurement (5th ed.). Briggs, D. C., & Peck, F. A. (2015). Using learning progressions to design vertical scales that support coherent inferences about student growth. Measurement: Interdisciplinary Research & Perspectives, 13, 75–99. Briggs, D. C., & Weeks, J. P. (2009a). The impact of vertical scaling decisions on growth interpretations. Educational Measurement: Issues & Practice, 28(4), 3–14. Briggs, D. C., & Weeks, J. P. (2009b). The sensitivity of value-added modeling to the creation of a vertical scale. Education Finance & Policy, 4(4), 384–414. Brigham, C. (1923). A Study of American Intelligence. Princeton, NJ: Princeton University Press. Brigham, C. (1930). Intelligence tests of immigrant groups. Psychological Review, 37, 158–165. Brookes, M. (2004). Extreme Measures: The Dark Visions and Bright Ideas of Francis Galton (1st U.S. ed.). New York: Bloomsbury. Brown, W. (1910). Some experimental results in the correlation of mental abilities. British Journal of Psychology, 3, 296–322. Brown, W. (1911). The Essentials of Mental Measurement. Cambridge: Cambridge University Press. Brown, W., & Stephenson, W. (1933). A test of the theory of two factors. British Journal of Psychology. General Section, 23(4), 352–367. https://doi.org/10.1111/j.2044-8295. 1933.tb00673.x Brown, W., & Thomson, G. H. (1921). The Essentials of Mental Measurement (2nd ed.). Cambridge: Cambridge University Press. Brown, W., & Thomson, G. H. (1940). The Essentials of Mental Measurement (4th ed.). Cambridge: Cambridge University Press. Bulmer, M. G. (2003). Francis Galton: Pioneer of Heredity and Biometry. Baltimore, MD: Johns Hopkins University Press. Burt, C. (1909). Experimental tests of general intelligence. British Journal of Psychology, 3, 94–177. Burt, C. (1914a). The measurement of intelligence by the Binet tests: Part I. The Eugenics Review, 6(1), 36–50. Burt, C. (1914b). The measurement of intelligence by the Binet tests: Part II. The Eugenics Review, 6(2), 140–152. Burt, C. (1946). Charles Edward Spearman. The Psychological Review, 53(2), 67–71. Burt, C. (1960). Gustav Theodor Fechner Elemente Der Psychophysik. British Journal of Statistical Psychology, 13(1), 1–10. Burt, C. (1962). Francis Galton and his contributions to psychology. The British Journal of Statistical Psychology, 15(1), 1–49. Bushell, W. F. (1960). The Cambridge mathematical tripos. The Mathematical Gazette, 44(349), 172–179. Byerly, H. C. (1974). Realist foundations of measurement. In K. F. Schafner & R. S. Cohen (Eds.), P.S.A. 1972. Dordrecht: Reidel, pp. 375–384.

References

337

Campbell, N. R. (1920). Physics, the Elements. Cambridge: Cambridge University Press. Campbell, N. R. (1928). An Account of the Principles of Measurement and Calculation. London: Longman, Green & Co. Campbell, N. R. (1940). Physics and psychology. British Association for the Advancement of Science, 2, 347–348. Carson, J. (1993). Army alpha, and the search for army intelligence. History of Science Society, 84(2), 278–309. Cattell, J. M. (1890). Mental tests and measurements. Mind, 15(59), 373–381. Cattell, R. B. (1945). The life and work of Charles Spearman. Journal of Personality, 14(2), 85–92. https://doi.org/10.1111/j.1467-6494.1945.tb01040.x Cattell, R. B. (1963). Theory for fuid and crystalized intelligence: A critical experiment. Journal of Educational Psychology, 54, 1–22. Cattell, R. B. (1966). The scree test for the number of factors. Multivariate Behavioral Research, 1(2), 245–276. https://doi.org/10.1207/s15327906mbr0102_10 Chakravartty, A. (2017). Scientifc realism. In Edward N. Zalta (Ed.), The Stanford Encyclopedia of Philosophy (Summer 2017 ed.). https://plato.stanford.edu/archives/sum2017/ entries/scientifc-realism/ Chang, H. (2004). Inventing Temperature: Measurement and Scientifc Progress. Oxford: Oxford University Press. Chang, H. (2019). Operationalism. In Edward N. Zalta (Ed.), The Stanford Encyclopedia of Philosophy (Winter 2019 ed.). https://plato.stanford.edu/archives/win2019/entries/ operationalism/ Chapman, P. D. (1988). Schools as Sorters: Lewis M. Terman, Applied Psychology, and the Intelligence Testing Movement, 1890–1930. New York: New York University Press. Chiang, H., Wellington, A., Hallgren, K., Speroni, C., Herrmann, M., Glazerman, S., & Constantine, J. (2015). Evaluation of the Teacher Incentive Fund: Implementation and Impacts of Pay-for-performance after Two Years (NCEE 2015–4020). Washington, DC: National Center for Education Evaluation and Regional Assistance, Institute of Education Sciences, U.S. Department of Education. https://ies.ed.gov/ncee/pubs/ 20154020/pdf/20154020.pdf Clif, N. (1992). Abstract measurement theory and the revolution that never happened. Psychological Science, 3, 186–190. Cobb, P. W. (1932). Weber’s law and the Fechnerian muddle. Psychological Review, 39, 533–551. Collingwood, R. G. (1923). Review of Spearman’s the nature of intelligence. Oxford Magazine, 42, 117–118. Cowan, R. S. (1977). Nature and nurture: The interplay of biology and politics in the work of Francis Galton. Studies in the History of Biology, 1, 133–208. Cravens, H. (1987). Applied science and public policy: The Ohio Bureau of Juvenile Research and the problem of juvenile delinquency, 1915–1930. In M. Sokal (Ed.), Psychological Testing and American Society, 1890–1930. New Brunswick: Rutgers University Press. Crease, R. P. (2011). World in Balance: The Historical Quest for an Absolute System of Measurement. New York: W. W. Norton & Company, Inc. Cronbach, L. J. (1947). Test “reliability”: Its meaning and determination. Psychometrika, 12,  1–16. https://doi.org/10.1007/BF02289289 Cronbach, L. J. (1951). Coefcient alpha and the internal structure of tests. Psychometrika, 16, 297–334. Cronbach, L. J. (1975). Five decades of public controversy over mental testing. American Psychologist, 30(1), 1–14.

338 References

Cronbach, L. J., Gleser, G. C., Nanda, H., & Rajaratnam, N. (1972). The Dependability of Behavioral Measurements: Theory of Generalizability of Scores and Profles. New York: John Wiley. Cronbach, L. J., & Meehl, P. E. (1955). Construct validity and psychological tests. Psychological Bulletin, 52, 281–302. Cudeck, R., & MacCallum, R. C. (Eds.). (2007). Factor Analysis at 100: Historical Developments and Future Directions. Mahwah, NJ: Lawrence Erlbaum Associates. Decroly, S. O., & Degand, J. (1910). La mesure de l’intelligence chez des enfants normaux, d’aprés les tests de Binet et Simon. Archives De Psychologie, 9, 81–108. Dreary, I. J., Lawn, M., & Bartholomew, D. J. (2008). A conversation between Charles Spearman, Godfrey Thomson, and Edward L. Thorndike: The International Examinations Inquiry Meetings 1931–1938. History of Psychology, 11(2), 122–142. Edgeworth, F. Y. (1888). The statistics of examinations. Journal of the Royal Statistical Society, 51, 598–635. Fancher, R. E. (1985a). Spearman’s original computation of g: A model for Burt? British Journal of Psychology, 76, 341–352. Fancher, R. E. (1985b). The Intelligence Men, Makers of the I.Q. Controversy (1st ed.). New York: Norton. Fancher, R. E. (2009). Scientifc cousins: The relationship between Charles Darwin and Francis Galton.  The American Psychologist,  64(2), 84–92.  https://doi.org/10.1037/ a0013339 Fechner, G. T. (1860). Elemente der Psychophysik. Leipzig: Breitkopf and Hartel; English translation by H. E. Adler, 1966, Elements of Psychophysics, Vol. 1, D. H. Howes & E. G. Boring (Eds.), New York: Rinehart and Winston. Fechner, G. T. (1887 [1987]). On the principles of measurement and on Weber’s law. Translated and edited by Eckart Scheerer. Psychological Research, 49, 213–219. Ferguson, A., Myers, C. S., Bartlett, R. J., Banister, H., Bartlett, F. C., Brown, W., Campbell, N. R., Craik, K. J. W., Drever, J., Guild, J., Houstoun, R. A., Irwin, J. O., Kaye, G. W. C., Philpott, S. J. F., Richardson, L. F., Shaxby, J. H., Smith, T., Thouless, R. H., & Tucker, W. S. (1938). Quantitative estimates of sensory events: Interim report of the committee appointed to consider and report upon the possibility of quantitative estimates of sensory events. British Association for the Advancement of Science, 108, 277–334. Ferguson, A., Myers, C. S., Bartlett, R. J., Banister, H., Bartlett, F. C., Brown, W., Campbell, N. R., Craik, K. J. W., Drever, J., Guild, J., Houstoun, R. A., Irwin, J. O., Kaye, G. W. C., Philpott, S. J. F., Richardson, L. F., Shaxby, J. H., Smith, T., Thouless, R. H., & Tucker, W. S. (1940). Quantitative estimates of sensory events: Final report of the committee appointed to consider and report upon the possibility of quantitative estimates of sensory events. Advancement of Science, 1, 331–349. Forrest, D. W. (1974). Francis Galton: The Life and Work of a Victorian Genius. London: Elek. Forsythe, A. R. (1935). Old Tripos Days at Cambridge. The Mathematical Gazette, 19(234), 162–179. Freedman, D., Pisani, R., & Purves, R. (2007). Statistics (4th ed.). New York: W. W. Norton & Company. Galanter, E. (1974). Stanley Smith Stevens 1906–1973. Psychometrika, 39(1), 1–2. https:// doi.org/10.1007/BF02291572 Galton, F. (1863). A development of the theory of cyclones. Proceedings of the Royal Society, 12(12), 385–386.

References

339

Galton, F. (1865). Hereditary talent and character. Macmillan’s Magazine, 12, 157–166, 318–327. Galton, F. (1869). Hereditary Genius. London: Macmillan. Reprinted 1979, Friedmann, London. Galton, F. (1872). Statistical inquiries into the efcacy of prayer. Fortnightly Review, 12, 125–135. Galton, F. (1874a). English Men of Science: Their Nature and Nurture. London: Macmillan. Galton, F. (1874b). Proposal to apply anthropological statistics from schools. Journal of the Anthropological Institute, 3, 308–311. Galton, F. (1875). Statistics by intercomparison with remarks on the law of frequency of error. Philosophical Magazine, 49, 33–46. Galton, F. (1877). Address to the Anthropological Department of the British Association. London: William Clowes and Sons. Galton, F. (1879a). The geometric mean, in vital and social statistics. Proceedings of the Royal Society, 29, 365–367. Galton, F. (1879b). Psychometric experiments. Brain, 2, 149–162. Galton, F. (1880). The opportunities of science masters as schools. Nature, 22, 9–10. Galton, F. (1883). Inquiries into Human Faculty and Its Development. London: Macmillan. Galton, F. (1884). Measurement of character. Fortnightly Review, 36, 179–185. Galton, F. (1885a). On the anthropometric laboratory at the late International Health Exhibition. Journal of the Anthropological Institute, 14, 205–218. Galton, F. (1885b). Some results of the Anthropometric Laboratory. Journal of the Anthropological Institute, 14, 275–287. Galton, F. (1886). Regression toward medicocrity in hereditary stature. The Journal of the Anthropological Institute of Great Britian and Ireland, 15, 246–263. Galton, F. (1888). Co-relations and Their Measurement, Chiefy from Anthropometric Data. Proceedings of Macmillan. London. Galton, F. (1889). Natural Inheritance. London: Macmillan. Galton, F. (1890). Kinship and correlation. North American Review, 150, 419–431. Galton, F. (1892). Finger Prints. London: Macmillan. Galton, F. (1904). Eugenics: Its defnition, scope, and aims. The American Journal of Sociology, 10(1), 1–25. https://doi.org/10.1086/211280 Galton, F. (1906). Anthropometry at schools. Journal of Preventive Medicine, 14, 93–98. Galton, F. (1908). Memories of My Life. London: Methuen. Garner, W. R. (1954). A technique and a scale for loudness measurement. The Journal of the Acoustical Society of America, 26(1), 73–88. https://doi.org/10.1121/1.1907294 Garner, W. R. (1958). Advantages of the discriminability criterion for a loudness scale. The Journal of the Acoustical Society of America, 30(11), 1005–1012. https://doi. org/10.1121/1.1909436 Garner, W. R. (1959). The development of context efects in half-loudness judgments. Journal of Experimental Psychology, 58(3), 212–219. https://doi.org/10.1037/h0041966 Garner, W. R., & Hake, H. W. (1951). The amount of information in absolute judgments. Psychological Review, 58(6), 446–459. https://doi.org/10.1037/h0054482 Garnett, J. C. M. (1919). General ability, cleverness, and purpose. British Journal of Psychology, 9, 345–366. Garnett, J. C. M. (1920). The single general factor in dissimilar mental measurements. British Journal of Psychology, 10, 242–258. Garnett, J. C. M. (1932). Further notes on the single general factor in mental measurements. British Journal of Psychology, 22, 364–372.

340 References

Gascoigne, J. (1984). Mathematics and meritocracy: The emergence of the Cambridge Mathematical Tripos. Social Studies of Science, 14, 547–584. Goddard, H. (1910). Four hundred feeble-minded children classifed by the Binet method. Pedagogical Seminary, 17, 387–397. Goddard, H. (1911). Two thousand children measured by the Binet measuring scale of intelligence. Pedagogical Seminary, 18, 232–259. Goddard, H. (1916). Introduction to A. Binet & T. Simon (Eds.), The Development of Intelligence in Children (the Binet-Simon Scale). Leopold Classic Library, pp. 5–8. Gökyigit, E. A. (1994). The reception of Francis Galton’s Hereditary Genius in the Victorian periodical press. Journal of the History of Biology, 27, 215–240. Gorsuch, R. L. (1983). Factor Analysis (2nd ed.). Hillsdale, NJ: L. Erlbaum Associates. Gould, S. J. (1981). The Mismeasure of Man (1st ed.). New York: Norton. Guilford, J. P. (1936). Psychometric Methods (1st ed.). New York: McGraw-Hill. Guilford, J. P. (1957). Louis Leon Thurstone 1887–1955: A Biographical Memoir. Washington, DC: National Academy of Science. Gulliksen, H. O. (1950). Theory of Mental Tests. New York: Wiley. Guttman, L. (1945). A basis for analyzing test-retest reliability. Psychometrika, 10, 255–282. Guttman, L. (1953). Reliability formulas that do not assume experimental independence. Psychometrika, 18, 225–239. Hardcastle, G. L. (1995). S. S. Stevens and the origins of operationism. Philosophy of Science, 62(3), 404–424. https://doi.org/10.1086/289875 Hart, B., & Spearman, C. (1912). General ability, its existence and nature. British Journal of Psychology, 5, 51–84. Hearnshaw, L. S. (1964). A Short History of British Psychology, 1840–1940. London: Methuen. Heidelberger, M. (2004). Nature from Within: Gustav Fechner and His Psychophysical Worldview. Translated by Cynthia Klohr. Pittsburgh: Universty of Pittsburgh Press. Hölder, O. (1901). Die Axiome der Quantität und die Lehre vom Mass. Berichte über die Verhandlungen der Koniglich Sachsischen Gesellschaft der Wissenschaften zu Leipzig. Mathematisch-Physische Klasse, 53, 1–46 [translated in Michell and Ernst, 1996, 1997]. Holland, P. W. (1990). On the sampling theory foundations of item response theory models. Psychometrika, 55(4), 577–601. Holzinger, K. J. (1945). Spearman as I knew him. Psychometrika, 10, 231–235. https:// doi.org/10.1007/BF02288890 Holzinger, K. J., & Swineford, F. (1937). The bi-factor method. Psychometrika, 2, 41–54. Horn, J. L. (1965). A rationale and test for the number of factors in factor analysis. Psychometrika, 30(2), 179–185. https://doi.org/10.1007/BF02289447 Horn, J. L., & McArdle, J. J. (2007). Understanding human intelligence since Spearman. In R. Cudeck & R. C. MacCallum (Eds.), Factor Analysis at 100. Mahwah, NJ: Lawrence Erlbaum Associates, pp. 205–247. Humphry, S. (2013). A middle path between abandoning measurement and measurement theory. Theory & Psychology, 23(6), 770–785. https://doi.org/10.1177/ 0959354313499638 Humphry, S. (2017). Psychological measurement: Theory, paradoxes, and prototypes. Theory & Psychology, 27(3), 407–418. https://doi.org/10.1177/0959354317699099 James, H. (1890). Principles of Psychology. New York: Holt, Rinehart and Winston. JCGM. (2012). International Vocabulary of Metrology—Basic and General Concepts and Associated Terms (VIM) (3rd ed.) (2008 version with minor corrections), Joint Committee for Guides in Metrology. www.bipm.org/en/publications/guides/vim.html

References

341

Jeronutti, A. (1912). Ricerche psicologiche sperimentali sugli alunni molto intelligenti. Lab. di Psicol. Sperim. della Reg. Univ. Roma. https://www.google.com/books/edition/ Journal_of_Educational_Psychology/P8RMAAAAYAAJ?hl=en&gbpv=1&dq=Jeronu tti,+A.+(1912).+Ricerche+psicologiche+sperimentali+sugli+alunni+molto+intellige nti.+Lab.+di+Psicol.+Sperim.+della+Reg.+Univ.+Roma&pg=PA285&printsec=fro ntcover Johnston, K. L. (1911). M. Binet’s method for the measurement of intelligence—some results. The Journal of Experimental Pedagogy and Training College Record, 1(1), 24–31. Jones, L. V. (1971). The nature of measurement. In R. L. Thorndike & W. H. Angof (Eds.), Educational Measurement (2nd ed.). Washington, DC: American Council on Education, pp. 335–355. Jones, L. V. (1996). Thelma Gwin Thurstone (1897–1993). American Psychologist, 51(4), 416–417. Jones, L. V. (2007). Remembering L. L. Thurstone. In R. Cudeck & R. C. MacCallum (Eds.), Factor Analysis at 100: Historical Developments and Future Directions. Mahwah, NJ: Lawrence Erlbaum Associates. Kahneman, D. (2011). Thinking, Fast and Slow (1st ed.). New York: Farrar, Straus and Giroux. Kane, M. T. (2006). Validation. In R. L. Brennan (Ed.), Educational Measurement (4th ed.). Santa Barbara: Greenwood Publishing Group, pp. 17–64. Kelley, T. L. (1921). The reliability of test scores. Journal of Educational Research, 3, 370–379. Kelley, T. L. (1923). The principles and technique of mental measurement. American Journal of Psychology, 34(3), 408–432. Kelley, T. L. (1928). Crossroads in the Mind of Man: A Study of Diferentiable Mental Abilities. Stanford, CA: Stanford University Press. Kelley, T. L. (1942). The reliability coefcient. Psychometrika, 7, 75–83. Kevles, D. J. (1968). Testing the Army’s intelligence: Psychologists and the military in World War I. Journal of American History, 55, 565–581. Kevles, D. J. (1985). In the Name of Eugenics: Genetics and the Uses of Human Heredity (1st ed.). New York: Knopf. Killeen, P. (1976). The schemapiric view. Notes on S.S. Stevens’ philosophy and psychophysics. Journal of the Experimental Analysis of Behavior, 25, 123–128. Kingdom, F. A. A., & Prins, N. (2016). Psychophysics: A Practical Introduction. Elsevier Academic Press. https://www.elsevier.com/books/psychophysics/kingdom/978-0-12373656-7 Kish, L. (1965). Survey Sampling. New York: J. Wiley. Kish, L. (1990). A choices profle: Rensis Likert: Social Scientist and Entrepreneur. Choices: The Magazine of Food, Farm, and Resource Issues, Agricultural and Applied Economics Association, 5(4), 36–39. Kolen, M. J., & Brennan, R. L. (2004). Test Equating, Scaling, and Linking: Methods and Practices  (2nd ed.; 3rd ed.). Springer. https://doi.org/10.1007/978-1-4757-4310-4 Kovacs, K., & Conway, A. R. (2016). Process overlap theory: A unifed account of the general factor of intelligence. Psychological Inquiry, 27(3), 151–177. https://doi.org/1 0.1080/1047840X.2016.1153946 Krantz, D. H. (1972). Measurement structures and psychological laws. Science, New Series, 175(4029), 1427–1435. https://doi.org/10.2307/1732837 Krantz, D. H., Luce, R. D., Suppes, P., & Tversky, A. (1971). Foundations of Measurement, Vol. 1: Additive and Polynomial Representations. New York: Academic Press.

342 References

Krueger, F., & Spearman, C. (1906). Die Korrelation zwischen verschiedenen geistigen Leistungsfahigkeiten. Zeitschrift fur Psychologig, XLIV, 50–114. Kuder, G. F., & Richardson, M. W. (1937). The theory of the estimation of test reliability. Psychometrika, 2, 151–160. Likert, R. (1932). A technique for the measurement of attitudes. Archives of Psychology, 22(140), 1–52. Likert, R., Roslow, S., & Murphy, G. (1934). A simple and reliable method of scoring the Thurstone attitude scales. The Journal of Social Psychology, 5(2), 228–238. Lindquist, E. F. (Ed.) (1951). Educational Measurement. Washington, DC: American Council on Education. Lippmann, W. (1922). The Lippmann-Terman debate. In N. J. Block & G. Dworkin (Eds.) (1976), The IQ Controversy. New York: Pantheon Book, pp. 4–44. Lombardo, P. A. (1985). Three generations, no imbeciles: New light on buck v. bell. New York University Law Review (1950), 60(1), 30–62. Lombardo, P. A. (2003). Taking eugenics seriously: Three generations of ??? are enough? Florida State University Law Review, 30(2), 191–218. Lord, F. M. (1953). On the statistical treatment of football numbers. American Psychologist, 8(12), 750–751. https://doi.org/10.1037/h0063675 Lord, F. M. (1954). Further comment on “football numbers”. American Psychologist, 9(6), 264–265. https://doi.org/10.1037/h0059284 Lord, F. M. (1980). Applications of Item Response Theory to Practical Testing Problems. Hillsdale, NJ: L. Erlbaum Associates. Lord, F. M., & Novick, M. R. (1968). Statistical Theories of Mental Test Scores. Reading, MA: Addison-Wesley. Lovie, A. D., & Lovie, P. (1993). Charles Spearman, Cyril Burt, and the origins of factor analysis. Journal of the History of the Behavioral Sciences, 29(4), 308–321. Lovie, A. D., & Lovie, P. (1995). The cold equations: Spearman and Wilson on factor indeterminacy. British Journal of Mathematical and Statistical Psychology, 48, 237–253. Lovie, P., & Lovie, A. D. (1996). Charles Edward Spearman, F.R.S. (1863–1945). Notes and Records: The Royal Society Journal of the History of Science, 50(1), 75–88. Luce, R. D. (1959). Individual Choice Behavior. New York: Wiley. Luce, R. D. (1972). What sort of measurement is psychophysical measurement? American Psychologist, 27(2), 96–106. https://doi.org/10.1037/h0032677 Luce, R. D. (1994). Thurstone and sensory scaling: Then and now. Psychological Review, 101(2), 271–277. https://doi.org/10.1037/0033-295X.101.2.271 Luce, R. D., & Edwards, W. (1958). The derivation of subjective scales from just noticeable diferences. Psychological Review, 65, 222–237. Luce, R. D., & Galanter, E. (1963). Psychophysical scaling. In R. D. Luce, R. Bush, & E. Galanter (Eds.), Handbook of Mathematical Psychology. New York: John Wiley and Sons, Inc, pp. 245–308. Luce, R. D., & Tukey, J. W. (1964). Simultaneous conjoint measurement: A new type of fundamental measurement. Journal of Mathematical Psychology, 1(1), 1–27. MacCorquodale, K., & Meehl, P. E. (1948). On a distinction between hypothetical constructs and intervening variables. Psychological Review, 55, 95–107. Macfarlane, A. (1916). Lectures on Ten British Mathematicians of the Nineteenth Century. New York: John Wiley and Sons. MacKenzie, D. (1976). Eugenics in Britain. Social Studies of Science, 6, 499–532. Mari, L., Wilson, M., & Maul, A. (2021). Measurement Across the Sciences. Springer Series in Measurement Science and Technology. Cham: Springer.

References

343

Markus, K. A., & Borsboom, D. (2013). Frontiers of Test Validity Theory: Measurement, Causation, and Meaning. New York: Routledge. Matheson, G. (2006). Intervals and ratios: The invariantive transformations of Stanley Smith Stevens. History of the Human Sciences, 19(3), 65–81. https://doi.org/10.1177/ 0952695106066542. Maul, A., Torres Irribarra, D., Mari, L., & Wilson, M. (2018). The quality of measurement results in terms of the structural features of the measurement process. Measurement, 116, 611–620. Maxwell, J. C. (1873). A Treatise on Electricity and Magnetism. Oxford: Clarendon Press. McGrane, J. A. (2015). Stevens’ forgotten crossroads: the divergent measurement traditions in the physical and psychological sciences from the mid-twentieth century. Frontiers in Psychology, 6, 431. https://doi.org/10.3389/fpsyg.2015.00431 McNutt, S. (2013). “A Dangerous Man”: Lewis Terman and George Stoddard, their Debates on Intelligence Testing, and the Legacy of the Iowa Child Welfare Research Station. The Annals of Iowa, The State Historical Society of Iowa. Measurement. (n.d.). In Wikipedia. https://en.wikipedia.org/w/index.php?title=Measur ement&oldid=995616873 Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational Measurement (3rd ed.). New York: American Council on Education/Macmillian, pp. 13–103. Michell, J. (1986). Measurement scales and statistics: A clash of paradigms. Psychological Bulletin, 100, 398–407. Michell, J. (1990). An Introduction to the Logic of Psychological Measurement. Hillsdale, NJ: Lawrence Erlbaum Associates, Inc. Michell, J. (1993). The origins of the representational theory of measurement: Helmholtz, Hölder, and Russell. Studies in History and Philosophy of Science Part A, 24(2), 185–206. https://doi.org/10.1016/0039-3681(93)90045-L Michell, J. (1997). Quantitative science and the defnition of measurement in psychology. British Journal of Psychology, 88, 355–383. Michell, J. (1999). Measurement in Psychology: Critical History of a Methodological Concept. New York: Cambridge University Press. Michell, J. (2005). The logic of measurement: A realist overview.  Measurement,38(4), 285–294. https://doi.org/10.1016/j.measurement.2005.09.004 Michell, J. (2006). Psychophysics, intensive magnitudes, and the psychometricians’ fallacy. Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences, 37(3), 414–432. Michell, J. (2009). Invalidity in validity. In R.W. Lissitz (Ed.), The Concept of Validity: Revisions, New Directions and Applications. Charlotte, NC: Information Age Publishing, pp. 111–133. Michell, J. (2012a). “The constantly recurring argument”: Inferring quantity from order. Theory and Psychology, 22(3), 255–271. Michell, J. (2012b). Alfred Binet and the concept of heterogeneous orders. Frontiers in Psychology, 3(261), 1–12. www.frontiersin.org/articles/10.3389/fpsyg.2012.00261/ full Michell, J. (2014). The Rasch paradox, conjoint measurement, and psychometrics: Response to Humphry and Sijtsma. Theory & Psychology, 24, 111–123. https://doi. org/10.1177/0959354313517524 Michell, J. (2020a). The fashionable scientifc fraud: Collingwood’s critique of psychometrics. History of the Human Sciences, 33(2), 3–21. https://doi.org/10.1177/ 0952695119872638

344 References

Michell, J. (2020b). Thorndike’s Credo: Metaphysics in psychometrics. Theory & Psychology, 30(3), 309–328. https://doi.org/10.1177/0959354320916251 Michell, J., & Ernst, C. (1996). The axioms of quantity and the theory of measurement: Translated from Part I of Otto Hölder’s German Text ‘Die Axiome der Quantita ät und die Lehre vom Mass’. Journal of Mathematical Psychology, 40, 235–252. Michell, J., & Ernst, C. (1997). The axioms of quantity and the theory of measurement: Translated from Part II of Otto Hölder’s German Text ‘Die Axiome der Quantität und die Lehre vom Mass’. Journal of Mathematical Psychology, 41, 345–356. Miller, G. A. (1975). Stanley Smith Stevens, 1906–1973: A Biographical Memoir. Washington, DC: New York Academy of Sciences. Minton, H. L. (1988). Lewis M. Terman: Pioneer in Psychological Testing. New York: New York University Press. Moskowitz, H. R., Scharf, B., & Stevens, J. C. (1974). Sensation and Measurement: Papers in Honor of S.S. Stevens. Netherlands: Springer. https://www.springer.com/gp/ book/9789027704740 Müller, G. E. (1879). Über die Maßbestimmungen des Ortssinnes der Haut mittelst der Methode der richtigen und falschen Fälle [On measuring the spatial sense of the skin with the method of right and wrong cases]. Pfügers Archiv, 19, 191–235. Niall, K. K. (1995). Conventions of measurement in psychophysics: Von Kries on the so-called psychophysical law. Spatial Vision, 9, 275–305. Nicolas, S., & Sanitioso, R. B. (2012). Alfred Binet and experimental psychology at the Sorbonne laboratory. History of Psychology, 15(4), 328–363. https://doi.org/10.1037/a0028060 Nicolas, S., Coubart, A. & Lubart, T. (2014). The program of individual psychology (1895–1896) by Alfred Binet and Victor Henri. L’Année psychologique, 114, 5–60. https://doi.org/10.4074/S000350331400102 Nisbett, R. E., Aronson, J., Blair, C., Dickens, W., Flynn, J., Halpern, D. F., & Turkheimer, E. (2012). Intelligence: New fndings and theoretical developments. The American Psychologist, 67(2), 130–159. https://doi.org/10.1037/a0026699 Norton, B. (1979). Charles Spearman and the general factor of intelligence: Genesis and interpretation in the light of sociopersonal considerations. Journal of the History of the Behavioral Sciences, 15(2), 142–154. Novick, M. R. (1966). The axioms and principal results of classical test theory. Journal of Mathematical Psychology, 3, 1–18. Otis, A. S. (1916). Some logical aspects of the Binet scale. The Psychological Review, 23(3), 165–179. https://doi.org/10.1037/h0073273 Otis, A. S. (1918). An absolute point scale for the group measurements of intelligence. Part 1.  Journal of Educational Psychology, 9(5), 239–261.  https://doi.org/10.1037/ h0072885 Pearson, K. (1892). The Grammar of Science. London: W. Scott. Pearson, K. (1904). On the laws of inheritance in man: II. On the inheritance of the mental and moral characters in man, and its comparison with the inheritance of the physical characters. Biometrika, 3(2/3), 131–190. Pearson, K. (1907). Mathematical Contributions to the Theory of Evolution XVI on Furthering Methods of Determining Correlation. Drapers’ Company Research Memoirs, Biometric Series IV. Cambridge: Cambridge University Press. Pearson, K. (1914). The Life, Letters and Labours of Francis Galton (Vol. 1). Cambridge: Cambridge University Press. Pearson, K. (1924). The Life, Letters and Labours of Francis Galton (Vol. 2). Cambridge: Cambridge University Press.

References

345

Pearson, K. (1930). The Life, Letters and Labours of Francis Galton (Vol. 3a and 3b). Cambridge: Cambridge University Press. Pedhazur, E. J., & Schmelkin, L. P. (1991). Measurement, Design, and Analysis: An Integrated Approach (1st ed.). Hillsdale, NJ: Lawrence Erlbaum Associates, pp.  15–29. ISBN 978-0-8058-1063-9. Peirce, C. S., & Jastrow, J. (1885). On small diferences of sensation. Memoirs of the National Academy of Sciences for 1884, 75–83. Retrieved July 25, 2018, from https:// psychclassics.yorku.ca/Peirce/small-difs.htm Piaget, J. (1926). The Language and Thought of the Child. New York: World Book. Piaget, J. (1928). Judgment and Reasoning in the Child. New York: World. Piaggio, H. T. H. (1933). Three sets of conditions necessary for the existence of a g that is real and unique except in sign. British Journal of Psychology, 24, 88–105. Plateau, J. (1872). Sur la msure des sensations physiques et sur la loi qui lie l’intensité de ces sensations à l’intensité de la cause excitante. Bulletin de l’Academie Royale des Sciences, des lettres et des beaux-arts Belgique, 2e serie, 33, 376–388. Pollack, R. H., & Brenner, M. W. (1969). The Experimental Psychology of Alfred Binet. New York: Springer. Prytulak, L. S. (1975). Critique of S. S. Stevens’ theory of measurement scale classifcation. Perceptual and Motor Skills, 41(1), 3–28. https://doi.org/10.2466/pms.1975. 41.1.3 Rasch, G. (1960). Probabilistic Models for Some Intelligence and Attainment Tests. Chicago, IL: University of Chicago Press. Reed, J. (1987). Robert M. Yerkes and the mental testing movement. In M. Sokal (Ed.), Psychological Testing and American Society, 1890–1930. New Brunswick: Rutgers University Press, 75–94. Reise, S. P. (2012). The rediscovery of bifactor measurement models. Multivariate Behavioral Research, 47(5), 667–696. https://doi.org/10.1080/00273171.2012.715555 Renwick, C. (2011). From political economy to sociology: Francis Galton and the social-scientifc origins of eugenics.  The British Journal for the History of Science, 44(3), 343–369. https://doi.org/10.1017/S0007087410001524 Rogers, A. L., & McIntyre, J. L. (1914). The measurement of intelligence in children by the Binet-Simon scale. British Journal of Psychology, 1904–1920, 7(3), 265–299. https://doi.org/10.1111/j.2044-8295.1914.tb00116.x Samelson, F. (1977). World War I intelligence testing and the development of psychology. Journal of the History of the Behavioral Sciences, 13(3), 274–282. Schmid, J., & Leiman, J. M. (1957). The development of hierarchical factor solutions. Psychometrika, 22, 53–61. Series, C. (1997/98). And what became of the women? Mathematical Spectrum, 30, 49–52. Sharp, S. (1899). Individual psychology: A study in psychological method. American Journal of Psychology, 10, 329–391. Shepard, R. N. (1981). Psychological relations and psychophysical scales: On the status of “direct” psychophysical measurement. Journal of Mathematical Psychology, 24(1), 21–57. https://doi.org/10.1016/0022-2496(81)90034-1 Sherry, D. (2011). Thermoscopes, thermometers, and the foundations of measurement. Studies in History and Philosophy of Science Part A, 42(4), 509–524. Siegler, R. S. (1992). The other Alfred Binet. Developmental Psychology, 28(2), 179–190. https://doi.org/10.1037/0012-1649.28.2.179 Sijtsma, K. (2012). Psychological measurement between physics and statistics. Theory & Psychology, 22, 786–809. https://doi.org/10.1177/0959354312454353

346 References

Sijtsma, K., & Emons, W. H. M. (2013). Separating models, ideas, and data to avoid a paradox: Rejoinder to Humphry. Theory & Psychology, 23, 786–796. https://doi. org/10.1177/0959354313503724 Slaney, K. (2017). Validating Psychological Constructs: Historical, Philosophical, and Practical Dimensions. Springer. https://link.springer.com/book/10.1057/978-1-137-38523-9 Sokal, M. M. (1987). Psychological Testing and American Society, 1890–1930. New Brunswick: Rutgers University Press. Spearman, C. (1904a). Note on the First German Congress for experimental psychology. American Journal of Psychology, 15, 447–448. Spearman, C. (1904b). Proof and measurement of association between two things. American Journal of Psychology, 15, 72–101. Spearman, C. (1904c). ‘General intelligence’ objectively determined and measured. American Journal of Psychology, 15, 201–293. Spearman, C. (1906). ‘Footrule’ for measuring correlation. British Journal of Psychology, 2, 89–108. Spearman, C. (1907). Demonstration of formulae for true measurement of correlation. American Journal of Psychology, 18, 161–169. Spearman, C. (1908). Method of ‘right and wrong cases’ (constant stimuli) without Gauss’ formulae. British Journal of Psychology, 2, 227–242. Spearman, C. (1910). Correlation calculated from faulty data. British Journal of Psychology, 3, 271–295. Spearman, C. (1913). Correlations of sums and diferences. British Journal of Psychology, 5, 417–426. Spearman, C. (1914a). The heredity of abilities. Eugenics Review, 6, 219–237. Spearman, C. (1914b). The theory of two factors. Psychological Review, 21, 101–115. Spearman, C. (1916). Some comments on Mr. Thomson’s paper. British Journal of Psychology, 8, 282–284. Spearman, C. (1920). Manifold sub-theories of the’ two factors’. Psychological Review, 27, 159–172. Spearman, C. (1922). Recent contributions to the theory of ‘two factors’. British Journal of Psychology, 13, 26–30. Spearman, C. (1923a). The Nature of ‘Intelligence’ and the Principles of Cognition. London & New York: MacMillan. Spearman, C. (1923b). Further note on the ‘theory of two factors’. British Journal of Psychology, 13, 266–270. Spearman, C. (1924). A challenge still open. Journal of Educational Psychology, 15, 393. Spearman, C. (1927a). The Abilities of Man: Their Nature and Measurement. London & New York: MacMillan. Spearman, C. (1927b). Critical notice of ‘The measurement of intelligence’, by E. L. Thorndike, et al. British Journal of Psychology, 17, 365–369. Spearman, C. (1928). Pearson’s contribution to the theory of two factors. British Journal of Psychology, 19, 95–101. Spearman, C. (1929a). The uniqueness of ‘g’. Journal of Educational Psychology, 20, 212–216. Spearman, C. (1929b). Response to T. Kelley. Journal of Educational Psychology, 20(8), 561–580. Spearman, C. (1930). Autobiography. In Vol I of A History of Psychology in Autobiography by C. Murchison. Worcester, MA: Clark University Press & London: Oxford University Press, pp. 229–333.

References

347

Spearman, C. (1931). Our need of some science in place of the word “intelligence”. Journal of Educational Psychology, 22 (6), 401–410. Spearman, C. (1934). The factor theory and its troubles: IV. Uniqueness of G.  Journal of Educational Psychology, 25(2), 142–153. https://doi.org/10.1037/h0074754 Spearman, C. (1937). Psychology Down the Ages, Vol 1. New York: Macmillan. Spearman, C. (1939). Thurstone’s work re-worked. The Journal of Educational Psychology, 15(1), 1–16. Spearman, C. (1940). Is ability random or organized? Journal of Educational Psychology, 30, 305–310. Spearman, C. (1946). Theory of general factor.  British Journal of Psychology, 36,  117–131. Spearman, C., & Holzinger, K. (1924). The sampling error in the theory of two factors. British Journal of Psychology, 15, 17–20. Spearman, C., & Holzinger, K. (1925). Note on the sampling error of tetrad diferences. British Journal of Psychology, 16, 86–88. Spearman, C., & Holzinger, K. (1930). The average value for the probable error of tetrad diferences. British Journal of Psychology, 20, 368–370. Spearman, C., & Jones, L. W. (1950). Human Ability: A Continuation of “the Abilities of Man”. London: Macmillan. Stadler, A. (1878). Über die Ableitung des psychophysischen Gesetzes. Philosophische Monatshefte, 14, 215–223. Stahl, S. (2006). The evolution of the normal distribution. Mathematics Magazine, 79(2), 96–113. Steiger, J. H. (1979). Factor indeterminacy in the 1930s and the 1970s: Some interesting parallels. Psychometrika, 44, 157–167. Steiger, J. H., & Schöneman, P. (1978). A history of factor indeterminacy. In S. Shye (Ed.), Theory Construction and Data Analysis in the Behavioral Sciences. San Francisco: Jossey Bass, pp. 136–178. Stein, Z. (2016). Social Justice and Educational Measurement. New York: Routledge. Stern, W. (1913). The Psychological Methods of Measuring Intelligence. Translated by G. M. Whipple, 160 pages. Warwick & York. https://doi.org/10.1037/11067-000 Sternberg, R. J. (1977). Component processes in analogical reasoning. Psychological Review, 84(4), 353–378. https://doi.org/10.1037//0033-295X.84.4.353 Stevens, J. C., Mack, J. D., & Stevens, S. S. (1960). Growth of sensation on seven continua as measured by force of handgrip. Journal of Experimental Psychology, 59(1), 60–67. https://doi.org/10.1037/h0040746 Stevens, S. S. (1935a). The operational basis of psychology. The American Journal of Psychology, 47(2), 323–330. https://doi.org/10.2307/1415841 Stevens, S. S. (1935b). The operational defnition of psychological concepts. Psychological Review, 42(6), 517–527. https://doi.org/10.1037/h0056973 Stevens, S. S. (1936). A scale for the measurement of a psychological magnitude: Loudness. Psychological Review, 43(5), 405–416. https://doi.org/10.1037/h0058773 Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103(2684), 677–680. https://doi.org/10.1126/science.103.2684.677 Stevens, S. S. (1951). Mathematics, measurement and psychophysics. In S. S. Stevens (Ed.), Handbook of Experimental Psychology. New York: Wiley, pp. 1–49. Stevens, S. S. (1955a). On the averaging of data. Science, 121(3135), 113–116. https:// doi.org/10.1126/science.121.3135.113 Stevens, S. S. (1955b). The measurement of loudness. The Journal of the Acoustical Society of America, 27(5), 815–829. https://doi.org/10.1121/1.1908048

348 References

Stevens, S. S. (1956). The direct estimation of sensory magnitudes: Loudness. The American Journal of Psychology, 69(1), 1–25. https://doi.org/10.2307/1418112 Stevens, S. S. (1957). On the psychophysical law. Psychological Review, 64(3), 153–181. https://doi.org/10.1037/h0046162 Stevens, S. S. (1958). Measurement and man. Science, 127(3295), 383–389. https://doi. org/10.1126/science.127.3295.383 Stevens, S. S. (1959). On the validity of the loudness scale. The Journal of the Acoustical Society of America, 31(7), 995–1003. https://doi.org/10.1121/1.1907827 Stevens, S. S. (1961). To honor Fechner and repeal his law. Science, 133(3446), 80–86. American Association for the Advancement of Science. https://doi.org/10.1126/ science.133.3446.80 Stevens, S. S. (1964). Concerning the psychophysical power law. Quarterly Journal of Experimental Psychology, 16(4), 383–385. https://doi.org/10.1080/17470216408416398 Stevens, S. S. (1966a). Matching functions between loudness and ten other continua. Perception & Psychophysics, 1(1), 5–8. https://doi.org/10.3758/BF03207813 Stevens, S. S. (1966b). A metric for the social consensus. Science, 151, 530–541. https:// doi.org/10.1126/science.151.3710.530 Stevens, S. S. (1968). Measurement, statistics, and the schemapiric view. Science, 161(3844), 849–856. https://doi.org/10.1126/science.161.3844.849 Stevens, S. S. (1971). Issues in psychophysical measurement. Psychological Review, 78(5), 426–450. https://doi.org/10.1037/h0031324 Stevens, S. S. (1974). Notes for a life story. In H. R. Moskowitz et al. (Eds.), Sensation and Measurement. Dordrecht-Holfand: D. Reidel Publishing Company, pp. 423–446. Stevens, S. S. (1975). Psychophysics: Introduction to Its Perceptual, Neural, and Social Prospects. New York: Wiley. https://doi.org/10.2307/1421904 Stevens, S. S., & Davis, H. (1938). Hearing: Its Psychology and Physiology. New York: Wiley. Stevens, S. S., & Galanter, E. H. (1957). Ratio scales and category scales for a dozen perceptual continua. Journal of Experimental Psychology, 54(6), 377–411. https://doi. org/10.1037/h0043680 Stigler, S. M. (1986). The History of Statistics: The Measurement of Uncertainty Before 1900. Cambridge, MA: Belknap Press of Harvard University Press. Stigler. S. M. (1992). A historical view of statistical concepts in psychology and educational research. American Journal of Education, 101, 60–70. Stigler, S. M. (1999). Statistics on the Table: The History of Statistical Concepts and Methods. Cambridge, MA: Harvard University Press. Suppes, P., & Zinnes, J. (1963). Basic measurement theory. In R. D. Luce, R. R. Bush, & E. Galanter (Eds.), Handbook of Mathematical Psychology, Vol. 1, New York: Wiley, pp. 1–76. Sweeney, G. 2001. “Fighting for the good cause”: Refections on Francis Galton’s legacy to American hereditarian psychology. Transactions of the American Philosophical Society, 91, part 2. Tal, E. (2020) Measurement in Science. In E. N. Zalta (Ed.), The Stanford Encyclopedia of Philosophy. https://plato.stanford.edu/archives/fall2020/entries/measurement-science/ Teets, D., & Whitehead, K. (1999). The discovery of Ceres: How Gauss became famous. Mathematics Magazine, 72(2), 83–93. Terman, L. M. (1911). The Binet-Simon scale for measuring intelligence: Impressions gained by its application on four hundred non-selected children. Psychological Clinic, 5(7), 199–206. Terman, L. M. (1916). The Measurement of Intelligence: An Explanation of and a Complete Guide for the Use of the Stanford Revision and Extension of the Binet-Simon Intelligence Scale. New York: Houghton Mifin Co.

References

349

Terman, L. M. (1919). The Intelligence of School Children: How Children Difer in Ability, the Use of Mental Tests in School Grading and the Proper Education of Exceptional Children. Boston: Houghton, Mifin & Company. Terman, L. M. (1922a). The great conspiracy or the impulse imperious of intelligence testers, psychoanalyzed and exposed by Mr. Lippmann. New Republic, 33(December 27, 1922), 116–120. Terman, L. M. (1922b). The psychological determinist; or democracy and the I. Q. The Journal of Educational Research, 6(1), 57–62. Terman, L. M., & Childs, H. G. (1912). A tentative revision and extension of the BinetSimon measuring scale of Intelligence.  Journal of Educational Psychology, 3(2), 61–74. https://doi.org/10.1037/h0075624 Thomson, G. H. (1916). A hierarchy without a general factor. British Journal of Psychology, 8, 271–281. Thomson, G. H. (1919a). On the cause of hierarchical order among correlation coeffcients. Proceedings of the Royal Society, Series A, 95, 400–408. Thomson, G. H. (1919b). The proof or disproof of the existence of general ability. British Journal of Psychology, 9, 321–336. Thomson, G. H. (1919c). The hierarchy of abilities. British Journal of Psychology, 9, 337–344. Thomson, G. H. (1920a). The general factor fallacy in psychology. British Journal of Psychology, 10, 319–326. Thomson, G. H. (1920b). General versus group factors in mental activities. Psychological Review, 27, 173–190. Thomson, G. H. (1924). The nature of general intelligence and ability (I). British Journal of Psychology, 14, 229–235. Thomson, G. H. (1927). The tetrad diference criterion. British Journal of Psychology, 17, 235–255. Thomson, G. H. (1934). On measuring g and s by tests which break the g-hierarchy. British Journal of Psychology, 25, 204–210. Thomson, G. H. (1935). The defnition and measurement of g (General Intelligence). Journal of Educational Psychology, 26, 241–262. Thomson, G. H. (1947). Charles Spearman, 1863–1945.  Obituary Notices of Fellows of the Royal Society, 5(15), 373–385. https://doi.org/10.1098/rsbm.1947.0006 Thomson, G. H. (1951). The Factorial Analysis of Human Ability (5th ed.). London: University of London Press. Thomson, W. [Lord Kelvin] (1889). Popular Lectures and Addresses. London: Macmillan. Thorndike, E. L. (1910). The measurement of the quality of handwriting in Edward L. Thorndike. Handwriting. Teachers College Record, 11(1910), 86–151. Retrieved from https://brocku.ca/MeadProject/Thorndike/1910/Thorndike_1910_1.html Thorndike, E. L. (1911). A scale for measuring the merit of English writing. Science, 33(859), 935–938. Thorndike, E. L. (1912). The measurement of educational products. The School Review, 20(5), 289–299. Thorndike, E. L. (1913). The Measurement of Achievement in Drawing. New York: Teachers College, Columbia University. Thorndike, E. L. (1914). The signifcance of the Binet mental ages. The Psychological Clinic, 8(7), 185–189. Thorndike, E. L. (1916). The signifcance of the Binet-Simon tests. The Psychological Clinic, 10(5), 121–123.

350 References

Thorndike, E. L. (1918). The nature, purposes, and general methods of measurements of educational products. In G. M. Whipple (Ed.), The Seventeenth Yearbook of the National Society for Study of Education, Part II: The Measurement of Educational Products. Bloomington, IL: Public School Publishing Co. Thorndike, E. L. (1921). On the organization of intellect. Psychological Review, 28(2), 141–151. https://doi.org/10.1037/h0070821 Thorndike, E. L. (1924). A challenge still open: Reply. Journal of Educational Psychology, 15, 394–395. https://doi.org/10.1037/h0064214 Thorndike, E. L. (1945). Charles Edward Spearman: 1863–1945.  The American Journal of Psychology,  58(4), 558–560. Thorndike, E. L., Woodyard, E., Cobb, M., & Bregman, E. O. (1927). The Measurement of Intelligence. New York: Bureau of Publications, Teacher’s College, Columbia University. Thurstone, L. L. (1925). A method of scaling psychological and educational test. Journal of Educational Psychology, 16, 433–451. Thurstone, L. L. (1926). The scoring of individual performance. Journal of Educational Psychology,17(7), 446-457. https://doi.org/10.1037/h0075125 Thurstone, L. L. (1927a). Psychophysical analysis. American Journal of Psychology, 38, 368–389. Thurstone, L. L. (1927b). A law of comparative judgment. Psychological Review, 34, 273–286. Thurstone, L. L. (1927c). A mental unit of measurement. Psychological Review, 34, 415–423. Thurstone, L. L. (1927d). Equally often noticed diferences. Journal of Educational Psychology, 18, 289–293. Thurstone, L. L. (1927e). Three psychophysical laws. Psychological Review, 34, 424–432. Thurstone, L. L. (1927f). The method of paired comparisons for social values. The Journal of Abnormal and Social Psychology, 21, 384–400. Thurstone, L. L. (1928a). Attitudes can be measured. American Journal of Sociology, 33(4), 529–554. https://doi.org/10.1086/214483 Thurstone, L. L. (1928b). The Phi-Gamma Hypothesis. Journal of Experimental Psychology, 11, 293–305. Thurstone, L. L. (1929). Fechner’s law and the method of equal-appearing intervals. Journal of Experimental Psychology, 12, 214–224. Thurstone, L. L. (1931a). The Reliability and Validity of Tests. Chicago: The University of Chicago. Thurstone, L. L. (1931b). Infuence of motion pictures on children’s attitudes. Journal of Social Psychology, 3, 291–305. Thurstone, L. L. (1931c). Rank order as a psychophysical method. Journal of Experimental Psychology, 14, 187–201. Thurstone, L. L. (1931d). The indiference function. The Journal of Social Psychology, 2(2), 139–167. https://doi.org/10.1080/00224545.1931.9918964 Thurstone, L. L. (1932). Stimulus dispersions in the method of constant stimuli. Journal of Experimental Psychology, 15, 284–297. Thurstone, L. L. (1934). The vectors of mind. Psychological Review, 41, 1–32. Thurstone, L. L. (1935). The Vectors of Mind. Chicago: Chicago University Press. Thurstone, L. L. (1938). Primary Mental Abilities. Chicago: University of Chicago Press. Thurstone, L. L. (1940). Current issues in factor analysis. Psychological Bulletin, 37, 189–236. Thurstone, L. L. (1945). The prediction of choice. Psychometrika, 10, 237–253. Thurstone, L. L. (1946). Theories of intelligence. The Scientifc Monthly,  62(2), 101–112.

References

351

Thurstone, L. L. (1947). Multiple-factor Analysis: A Development and Expansion of the Vectors of Mind. Chicago: Chicago University Press. Thurstone, L. L. (1952). L. L. Thurstone. In G. Lindzey (Ed.), A History of Psychology in Autobiography (Vol. VI). Englewood Clifs, NJ: Prentice Hall, pp. 294–321. Thurstone, L. L. (1959). The Measurement of Values. Chicago: University of Chicago Press. Titchener, E. B. (1905). Experimental Psychology: A Manual of Laboratory Practice. London: MacMillan. Torgerson, W. S. (1958). Theory and Methods of Scaling. New York: Wiley. Treisman, M. (1964). Sensory scaling and the psychophysical law. Quarterly Journal of Experimental Psychology, 16(1), 11–22. https://doi.org/10.1080/17470216408416341 Tryon, R. C. (1957). Reliability and behavior domain validity: Reformulation and historical critique. Psychological Bulletin, 54(3), 229–249. Velleman, P. F., & Wilkinson, L. (1993). Nominal, ordinal, interval, and ratio typologies are misleading. The American Statistician, 47(1), 65–72. https://doi.org/10.1080/000 31305.1993.10475938 Ward, L. M. (2017). S. S. Stevens’s invariant legacy: Scale types and the power law. The American Journal of Psychology, 130(4), 401–412. https://doi.org/10.5406/amerjpsyc.130.4.0401 Warren, R. M., & Warren, R. P. (1963). A critique of S. S. Stevens’ “New Psychophysics”. Perceptual and Motor Skills, 16(3), 797–810. https://doi.org/10.2466/pms.1963.16.3.797 Wheeler, R. (1924). Book review: The nature of “intelligence” and the principles of cognition. The Journal of Philosophy, 21(11), 294–301. https://doi.org/10.2307/2014797 Wijsen, L. D., Borsboom, D., Cabaço, T., & Heiser, W. J. (2019). An academic genealogy of psychometric society presidents. Psychometrika, 84(2), 562–588. https://doi. org/10.1007/s11336-018-09651-4 Williams, V. S., Pomerich, M., & Thissen, D. (1998). A comparison of developmental scales based on Thurstone methods and item response theory. Journal of Educational Measurement, 35, 93–107. Wilson, E. B. (1928a). Review of The Abilities of man by C. Spearman. Science, 67, 244–248. Wilson, E. B. (1928b). On hierarchical correlation systems. Proceedings of the National Academy of Sciences, 14, 283–291. Wilson, E. B. (1929). Comment on Professor Spearman’s note.  Journal of Educational Psychology, 20(3), 217–223. https://doi.org/10.1037/h0071925 Wilson, E. B. (1933a). On the invariance of general intelligence. Proceedings of the National Academy of Sciences, 19, 768–772. Wilson, E. B. (1933b). Transformations preserving the tetrad equations. Proceedings of the National Academy of Sciences, 19, 882–884. Wilson, E. B. (1933c). On overlap. Proceedings of the National Academy of Sciences, 19, 1039–1044. Wissler, C. (1901). The Correlation of Mental and Physical Tests. Doctoral Dissertation, Columbia University. Wolf, T. H. (1973). Alfred Binet. Chicago: University of Chicago Press. Wolfe, D. (1956). Louis Leon Thurstone: 1887–1955. The American Journal of Psychology, 69(1), 131–134. Wright, B. D. (1997). A history of social science measurement. Educational Measurement: Issues and Practice, 45(4), 51–71. Yen, W. M. (1986). The choice of scale for educational measurement: An IRT perspective. Journal of Educational Measurement, 23, 299–325.

352 References

Yerkes, R. M. (1917). The Binet versus the point scale method of measuring intelligence. Journal of Applied Psychology, 1, 111–122. Yerkes, R. M. (1919). Report of the psychology committee of the national research council. The Psychological Review, 26(2), 83–149. Yerkes, R. M. (1921). Psychological examining in the United States army. Memoirs of the National Academy of Sciences, 15, 890. Yule, G. U. (1896). On the correlation of total pauperism with proportion of outrelief. The Economic Journal (London), 5(20), 603–611. https://doi.org/10.2307/2957204 Yung, Y. F., Thissen, D., & McLeod, L. D. (1999). On the relationship between the higher-order factor model and the hierarchical factor model. Psychometrika, 64, 113–128. Zand Scholten, A., & Borsboom, D. (2009). A reanalysis of Lord’s statistical treatment of football numbers. Journal of Mathematical Psychology, 53(2), 69–75. https://doi. org/10.1016/J.JMP.2009.01.002 Zenderland, L. (1987). The debate over diagnosis: Henry Herbert Goddard and the medical acceptance of intelligence testing. In M. Sokal (Ed.), Psychological Testing and American Society, 1890–1930. New Brunswick: Rutgers University Press.

INDEX

Note: Page numbers in italic indicate a fgure and page numbers in bold indicate a table on the corresponding page

absolute measurement 92–93 absolute scaling 263–264 additivity 6–7, 51, 298–299, 309 angle judgment test 104, 105–106 anthropometric laboratory 65, 87, 90, 95–98, 100–106, 102 anthropometry, defnition 100–101 apprehension of experience 213 Army Alpha 136, 137, 171, 218 Army Beta 136, 137, 171, 218 associationism 72, 181 associative processes 110–111 attenuation formula 187, 189, 191 attitude measurement 280–283, 286–288 attributes 53; defnition 5; extensive 6–7; intensive 7; psychological 35–37, 39–50, 93–94

1911 Binet-Simon revised intelligence scale 155–162, 157, 158–159; background 139, 139–144; BinetSimon measuring scale 139, 144–162, 154, 155, 157, 158–159, 168–170; conceptualization of measurement 164–167; criticisms of 168–170; education’s role in intelligence 162–164; interest in child development 143–144; legacy 170–173 Binet-Simon intelligence scale 139, 144–162, 154, 155; 1908 revision 155–162, 157, 160; 1911 revision 155–162, 157, 158–159; criticisms of 168–170 binomial probability distribution 21–23, 22, 75–76 Birkhof, G. D. 307 bivariate normal distribution 120 Boring, E. G. 297 Brown, William 194–199, 244, 250 Buck, Carrie 124–126 Buck v. Bell 125–126 Burt, Cyril 59, 250, 257n22

Binet, Alfred 17–18, 135–176, 292; 1905 Binet-Simon intelligence scale 148–155, 150, 154; 1908 Binet-Simon revised intelligence scale 155–162, 157, 160;

Campbell, Norman 295, 309; representational approach to measurement 298–304, 305 category scale 319, 319–320

1905 Binet-Simon intelligence scale 148–155, 150, 154 1908 Binet-Simon revised intelligence scale 155–162, 157, 160 1911 Binet-Simon revised intelligence scale 155–162, 157, 158–159

354 Index

Cattell, James McKeen 16, 58, 65, 100, 132n3, 135, 144–145, 164, 173, 181, 182 causal theory 53 central limit theorem 23–24, 87 Charcot, Jean-Martin 142–143 classical test theory 199–201 cognition, Spearman’s model of human 211–216, 213 color sense test 104, 105 comparatal dispersion 271 comparative judgment: Thurstone’s law of comparative judgment 270–278, 290n2 Constant Method 41–50, 43, 44 construct 12, 25n5, 252 correlation: discovery of 111–112; measurement and 205–226 correlation coefcients 182–187, 184, 185 correlation matrix 209–211, 210 cross-modality matching 317–319, 325 Darwin, Charles 68; impact on Francis Galton 71–73 Darwin, Erasmus 67–68, 128 Darwin, Robert 68 Delboeuf, Joseph 142–143 derived measurement 298–303 diagnostic classifcation 138 Dickson, J. Hamilton 121 direct magnitude estimation 294, 295 disattenuation 182–187, 193–194 discriminability scale 320 discriminal dispersion 273, 278, 284–285 discriminal processes (Thurstone’s) 265–270, 267 Edgeworth, Francis 176n15, 203n15 educational measurement 94, 136, 137, 259; defnition 12–14 education’s role in intelligence 162–164 error curves 24 eugenics 17, 65, 78–79, 124–130, 138, 220; positive 127 experimentation 10 extensive attributes 6–7 factor analysis 262 Fawcett, Philippa 78 Fechner, Gustav 17, 27, 33, 63, 266, 292, 303; background 32–35; conceptualization of measurement 35–37; legacy of 56–59; on meaning of measurement 32; measurement formula 39–41, 52, 55–56

Fechner’s Law 39–41, 50, 316–317; criticisms of 50–56; quantity objection to 51–55 Féré, Charles 142 Ferguson Committee 295, 303–304 fractionation 313 frequency distribution 21 fundamental measurement 298–303, 309 g 293 see also mental energy; indeterminacy of 235–239, 236, 239, 245–247; interpretation by Spearman 216–218, 217, 220–223, 248 Galton, Francis 17, 23, 67, 135, 292; anthropometric laboratories 100–106, 102; associative processes 110–111; background 65–73; conceptualization of measurement 89–94; conceptualization of relative measurement 78–98; correlation 121–124; described as polymath 65–66; discovery of regression and correlation 111–124; eugenics 126–130; familial height data 118, 117–121, 119; height tabulation by forearm length 121–123, 122; human heredity and 71–73; human intelligence measurement 107–111; infuenced by Adolphe Quetelet 73–74; infuenced by Charles Darwin 71–73; instrumental innovations 100–134; law of regression 112, 114–117, 116; legacy 130–131; measurement of individual diferences and 63–94; mental imagery and visualization 108–110, 109; quincunx 74, 74–76, 88; regression to the mean 117–121, 119, 120; statistical method for heredity 113–117; statistical scale for intercomparisons 85–91, 95–98; test of angle judgment 104, 105–106; test of color sense 104, 105; test of length judgment 104, 105; test of squareness 105–106; test of visual acuity 104, 105; Tripos and 76–83, 83 Galton, Samuel 67 Galton Board 74–76 Garner, W. R. 321 Garnett, Maxwell 244 Gauss, Karl Friedrich 23–24, 26n8 Gaussian distribution 23 Goddard, Henry 146, 170–171 group factors 230, 241 Guilford, J. P. 239, 260 Gulliksen, Harold 178, 271, 288

Index

Hart, Bernard 230 height: heritability of 87–89; tabulation by forearm length 121–123, 122 Henri, Victor 144–145 hereditary genius 78–85, 83 heredity 71–73, 111–112 Holzinger, Karl 211, 244 homogeneity 9–10, 51–52 Hopkins, William 78 human intelligence measurement 107–111 individual diferences 85–89; in intelligence 144–162; measurement of 63–94 intelligence: Binet’s conceptualization of measurement 164–167; Binet-Simon intelligence scales 139, 144–162, 154, 155, 157, 158–159, 160; Edwin Wilson’s indeterminacy of g 235–239, 236, 239, 245–247; Godfrey Thomson’s sampling theory of ability 229–235, 231–233, 254–255, 255; Louis Thurstone’s multiple-factor method 236–243; role of education 162–164; theory vs. method in measuring 227–253 intelligence quotient see IQ intelligence testing 17–18, 136–138; scaling 263–264 intensive attributes 7 intercomparisons 85–91, 95–98 intergenerational associations 79 intergenerational variability 113 International System of Units (SI) 29, 30, 31 International Vocabulary of Measurement 4 interval scale 57, 306, 307, 308, 309–310, 313, 314, 317, 330, 333n9 invariance 10, 242, 243, 257, 285–286, 288, 295, 306, 318, 326, 351 IQ 138, 223 James, William 51 jnds see just noticeable diferences judgment: Thurstone’s law of comparative judgment 270–278, 290n2 just noticeable diferences 38–39, 41, 44, 47–51, 55–56, 103, 266, 313 Kelley, Truman 201, 223, 258n24 Laplace, Pierre 23, 24 latent attribute 25n5

355

latent trait 12 latent variable 207 law of comparative judgment 270–273, 290n2 law of conation 215 law of deviations 23 law of errors 23, 43–50, 74 law of fatigue 215 law of mental energy 215 law of primordial potencies 215 law of retentiveness 215 length judgment test 104, 105 Likert, Rensis 286–288; Likert scales 287–288 Locke, John 72 logarithmic interval scale 317 Lord Kelvin see Thomson, William Mach, Ernst 54–55 magnitude, defnition 5–6 magnitude estimation 313–320, 315, 322–324 manifest variable 207 Mathematics Tripos see Tripos Maxwell, James Clerk 30, 31, 32, 37 measurability hypothesis 14 measurement: absolute 92–93; attitude 286–288; Binet’s conceptualization of 164–167; Campbell’s representational approach 298–305; conceptualization by Galton 89–94; conceptualization by Gustav 35–37; defnition 3–5, 12–13, 304–305, 329–330; educational 94, 136, 137, 259; fundamental 299–302, 309; human intelligence 107–111; meaning of 32; process of operational 313–320, 314; relative 17, 78–93; of sensation 303–304; Spearman’s conceptualization of 221–225; Stevens’ conceptualization of 304–312; Stevens’ theory of scales of 294–295, 304–312; subjective unit of 274; terminology 5–10; Thorndike’s credo on 2–3, 14; through correlation 205–226; Thurstone’s conception of 283–286 measurement error 177–204 measurement formula 39–41, 52, 55–56 measurement of individual diferences 63–94 measurement scales 135–176; Stevens’ taxonomy 306, 306–310, 308, 321, 331 measurement units 28–30; subjective 283–285

356 Index

mental age scale 263 mental disability, diagnosing 146–147 mental energy 216–218, 217 see also g mental imagery measured by Francis Galton 108–110, 109 mental orthopedics 163 mental testing 219–220 mental tests 135–176 Mersenne, Marin 28 Method of Constant Stimulus see Constant Method method of equal appearing intervals 273–274, 279–283, 282 method of magnitude estimation 313–320, 315, 322–324 method of right and wrong cases 41–50 metric system 29 metrology 28; defnition 4, 8–9; evolution of 29 Michell, Joel 3, 4, 57, 130–131, 285, 295, 303–304, 326–329 Mill, John Stuart 72 Modern Ideas About Children (Binet) 162–165, 174 Mohs scale 8–9 Moray House Tests 229 multiple-factor theory of ability (Thurstone’s) 239–243, 240, 247–248 multitude 5 National Intelligence Tests 137 Natural Inheritance (Galton) 64, 74, 85, 87–88, 92, 106 natural selection 71–73 neogenetic human cognition model 211–216, 213 nominal scale 306, 306–307, 308 normal distribution 23, 24, 58, 64, 73–74, 81, 85, 106; in hereditary genius 78–85, 83 numerals 4–5, 19, 292, 294, 299–303, 305 objective criteria of irrelevance 281 ogive 85, 96, 97 On the Origin of Species (Darwin) 71–72 operationalism 19, 218, 310–312, 327 operational measurement process (Stevens) 313–320, 314 ordinal scale 306, 306–307, 308 Oresme, Nicole 27 Pearson, Karl 82, 193, 259 percentile 85

permissible statistics 19 physical magnitude 45, 45–46 Plateau-Delboeuf method of bisection 57 point scale method 263 positive eugenics 127 power law (Stevens’) 316–317, 325 probable error 24–25, 26n10, 123 psychological attributes: heritability of 77–85; measurement of 35–37, 39–50, 93–94, 292–294 psychological scaling 110 psychometricians 12, 19 psychometrics 15, 19, 58, 70, 131, 198; seeds of 259–289 psychometry 64–65 psychophysics 17, 103, 265–266, 303–304, 313; defnition 37; Fechner’s conceptualization of measurement 35–37; Fechner’s Law 39–55; measurement 27–62; measurement formula 39–41, 52, 55–56; method of right and wrong cases 41–50; origins of 32–41; Weber’s law 37–39 quantifcation 2, 4, 167, 172 quantity 5, 8–9 quantity objection 14, 285, 292–293, 326–327; to Fechner’s Law 51–55 Quetelet, Adolphe 73–74, 81 quincunx 74, 74–76, 88, 113, 113–117, 116 ratio scale 306, 307, 308, 309, 314, 320, 321 realism 327 reference, role of 8–9 regression, discovery of 111–112 regression coefcient 113, 115 regression efect 121 regression to the mean 64, 117–121, 119, 120 relative measurement 17, 78–98 reliability 18; reliability coefcient 178, 196–199, 197, 209 replication 187–190 representationalism 19 reversion 112, 114–117, 116 Routh, Edward John 78 sampling theory of ability (Thomson’s) 229–235, 231–233, 244–245, 254–255, 255 scaling 110, 263–264; absolute 263–264; psychological 110; Thurstonian 263

Index

Scott, Charlotte 78 sensation, measurement of 303–304 sensation intensity 45, 45–50, 46 SI see International System of Units (SI) Simon, Théodore 138, 148 single judgment experiment 42 social physics 73–74 sone scale for loudness 298, 315–316, 321–322 Spearman, Charles 18, 169, 178–179, 292; background 179, 179–182; bandshooting example 191, 191–193; challenges to theory of two factors 227–253, 228; conceptualization of measurement 221–225; correlation matrix 209–211, 210; correlations between discrimination and intelligence 183, 183–187, 184, 185; depiction of mental energy 216–218, 217; development of classical test theory 199–201; disattenuation 182–187, 193–194, 200; interpretation of g 216–218, 217, 220; legacy 248–252; measurement error and 187–193; model of noegenetic human cognition 211–216, 213; replications 187–189; responses to Edwin Wilson 245–247; responses to Godfrey Thomson 244–245; responses to Louis Thurstone 247–248; SpearmanBrown prophecy formula 196–199, 197; theory of two factors 205–226, 206, 293; Thomson’s sampling theory of ability and 229–235; Thurstone’s multiple-factor method and 239–243; Wilson’s indeterminay of g and 235–239, 236 Spearman-Brown prophecy formula 196–199, 197 Spearman’s theory of two factors 205–226, 206, 293; challenges to 227–253, 228; corroboration of 208–211; formalization of 205–208; utility of 218–221 specifc factors 207–208, 216, 219, 226–232, 231 standard deviation 24 standardization 13, 226n13 standard units of measurement 28 Stanford-Binet test 138 statistical scale for intercomparisons 85–91, 95–98, 108–109, 109 sterilization legislation 124–127, 125 Stevens, Stanley S. 19; background 295–298, 296; conceptualization of

357

measurement 304–312; criticisms of 321–329; cross-modality matching 317–319, 325; legacy 329–332; method of magnitude estimation 313–320, 315, 322–324; on operationalism 310–312; power law 316–317, 325; pragmatism of 319–320; process of operational measurement 313–320, 314; role of argument 319–320; taxonomy of 306, 306–310, 308, 321, 331; theory of scales of measurement 294–295, 304–312 Street of Chance experiment 273–278, 275–276, 278, 279 subjective measurement units 174, 283–285 sweet pea experiments 114–117 Tannery, Jules 51, 129 taxonomy for measurement scales, Stevens’ 306, 306–310, 308, 321, 331 Terman, Lewis 137, 162, 171–172 test of angle judgment 104, 105–106 test of length judgment 104, 105 test of squareness 105–106 test validity 25n6 theory of associationism 72 theory of two factors see Spearman’s theory of two factors thermometry 31, 91–92, 108 Thomson, Godfrey 229; sampling theory of ability 229–235, 231–233, 244–245, 254–255, 255; Spearman responding to 244–245 Thomson, William 30, 31 Thorndike, Edward 168–170, 181, 224; credo on measurement 2–3, 14 Thorndike’s credo 2–3, 14 Thurstone, Louis 18–19, 170, 229, 259, 293; background 260–265, 261; conception of measurement 283–286; constructing psychological continuum 273–283; discriminal processes 265–270, 267; law of comparative judgment 270–278, 290n2; legacy 288–289; Likert scales 286–288; method of equal appearing intervals 273–274, 279–283, 282; multiple-factor method 239–243, 240, 247–248; role of invariance 285–286; Spearman responding to 247–248; Street of Chance experiment 273–278, 275–276, 278, 279 Thurstone, Thelma Gwinn 264–265 Thurstonian scaling 263

358 Index

time, measurement of 28–29 Titchener, Edward 58 Tripos 69, 80, 76–83, 83, 92, 93, 107 uncertainty 10 units 28–29, 30, 31, 259–261, 274, 283–285 Virginia Sterilization Act 124–127, 125 visual acuity tests 104, 105 visualization measured by Francis Galton 108–110, 109

Weber, Ernst 34, 37–39 Weber’s law 37–39, 49–50, 55 Wilson, Edwin 229; indeterminacy of g 235–239, 236, 239, 245–247; Spearman responding to 245–247 Wissler, Clark 183 Wrangler 76–81 Wundt, Wilhelm 58 Yerkes, Robert 136–137 Yule, Udney 124, 187–189