216 98 31MB
English Pages 418 [420] Year 1969
METHODOLOGY
PSYCHOLOGICAL STUDIES
Editorial
Committee
J O H A N T. B A R E N D R E G T H A N S C. B R E N G E L M A N N / G O S T A
EKMAN
S I P K E D. F O K K E M A / N I C O H. F R I J D A J O H N P. V A N D E G E E R / A D R I A A N D. D E G R O O T MAUK MULDER/RONALD WILLIAM RAMSAY SIES W I E G E R S M A
M O U T O N - THE H A G U E - PARIS
METHODOLOGY Foundations of inference and research in the behavioral sciences by
A D R I A A N D. DE GROOT Professor of Applied Psychology and Methodology University of Amsterdam
M O U T O N - THE H A G U E - PARIS
Translated from the Dutch by J.A.A. Spiekerman Original title: Methodologie, 19684
Library of Congress Catalog Card Number: 72-99016
© MOUTON & CO, 1969 No part of this book may be translated or reproduced in any form by print, photoprint, microfilm, or any other means, without written permission from the publisher. P R I N T E D IN B E L G I U M
PREFACE
This book is a revised translation of a Dutch text which was first published in 1961 and is now in its fourth edition. In the Netherlands and the Flemish part of Belgium it has been used by students as a textbook, and by researchers as a reference book and source of ideas. Psychologists, psychiatrists, sociologists, educational psychologists and political scientists have been its main users. The present, revised, English translation was prepared in the years 1966-1968, under the supervision of the author. It is primarily meant to serve the same audience, that is, students and researchers in the broad area of the behavioral and social sciences — from biology and physiology to, and including, history. In what respects does this Methodology differ from the many others that have been published ? First, the main methodological issues are treated in a general, nonparochial, and non-technical way, so as to be useful to a wide range of research workers — and research sponsors. The author's basic conviction is that methodology is a concern common to all scientists and scholars, and must be treated as such. Accordingly, the text discusses problems rather than field-specific solutions, and principles rather than techniques. The contents of the book can be said to border on philosophy of science, if not, in large part, to be just that. In any case — regardless of labels — the reader is primarily invited to think through basic issues with the author — issues which may arise in any empirical science. Second, in its presentation and selection of examples, the book focuses on applied and border-line problems rather than on pure science. As one reviewer of the Dutch edition (FOKKEMA 1964) remarked, the experimental psychologist, for instance, will find that little space has been devoted to his favorite methodological conceptualizations. Apart from reflecting the author's own prior preoccupations and present interests, the emphasis v
PREFACE
on applied problems is thought to have some major advantages. First of all, one should not offer methodology to those who have had too much of it (KOCH 1959), but rather to those who have too little. Moreover, in applied areas the researcher cannot hide easily in technicalities and jargon; his interactions with sponsors and other agents of society tend to promote open discussions and straightforward formulations. The fact that, in applied research, the cost of methodological refinement must somehow be weighed openly against the social importance of expected outcomes is bound to foster methodological soul-searching and selfcriticism. Finally, the analysis of problems in applied research often affords an opportunity to present methodological issues naked, as it were: without the impressive clothing of advanced theories and techniques. For these reasons, examples have been taken from such fields as psychosomatic medicine, the evaluation of educational (or therapeutic) effects, psychoanalytic (and historical) interpretation, and the like. Third, while the author was strongly influenced by, and is heavily indebted to, the huge stream of research and methodology from the United States of America, this remains a European book. For one thing, it originated in a climate which has never been as thoroughly influenced by behaviorism as has the American scene. In the European climate, 'mind' and 'thinking,' for instance, have never been dirty words. European conceptions of scientific method, generally — for all their obvious weaknesses in past research practice — have never suffered from overrestrictiveness. Consequently, it should be easier for a European author to be non-partisan — and more difficult to be strict. This is primarily a nonpartisan methodology, it is hoped; its relative lack of strictness is intentional and should be an asset. The reader who wishes to adhere to a stricter canon of research behavior can add his own specific rules to those he finds in the present volume. It is somewhat questionable, it is true, whether all this is rightly called 'European'; a better label might be 'middle-of-theroad' — but the reader had better find this out for himself and see whether or not he likes it. Fourth, this presentation of scientific methodology is based on a particular conception of the scientific enterprise. Science is conceived not so much as a system of concepts, nor as a system of statements, but rather as a process, or, a system of activities. The structure of the process can be described in general terms and its underlying principles analyzed. The empirical cycle, as followed by a typical hypothesis-testing investigaVI
PREFACE
tion, is in fact the basis of the orderly description of scientific activities which the book attempts to offer. The value of these activities — the research goals and methods chosen, as well as the outcomes found — is ultimately determined within the social process of science itself. The final judgment on theories, methods, and techniques evolves from the democratic interaction of scientists among themselves (and with their audience: society); it can be said to rest in the hands of the Forum of firstclass scholars in the field in question, now and later — an international Forum, ultimately, 'the Forum of history.' Fifth, the prescriptive part of the present methodology is based on no other principles than those implied in the above conception of science as a social process. No presupposed epistemological 'theory of science' is needed; all necessary do's and don'ts derive from the consideration that the social process — given its goal — must 'work.' Objectivity, logic, clarity, and intellectual honesty, for instance, are needed in order to prevent misunderstandings in the exchange of information; and the Forum cannot pass judgment on vague, or inconsistent, or otherwise non-falsifiable, assertions. A methodology based solely on this principle is necessarily undogmatic; as in democracy at large, there is much freedom in the scientist's choice of goals and methods. On the other hand, the scientific process does require that the expressed intentions of the investigator be made explicit. If it were necessary to summarize the prescriptive part of the book in a set of rules, rule number one would have to be a rationally conceived yvcoGi cauxov for the scientist: 'continually try to know and specify what you are really after.' A few words about the structure of the book may be helpful. Sections 1; 1 and 1; 2 present an introduction to the subject matter in general, and to the basic concept of the empirical cycle in particular. The main purpose of this part is to show that the typical cycle of scientific inquiry, observe-guess-predict-check, is a specific instance of the cyclical course of any experiential process; it is not some invention, or arbitrary convention, of scientists among themselves. Though certainly worth writing and, hopefully, worth reading, such a demonstration is not strictly necessary for the further development of the argument. From section 1; 3 on, Chapters 1 through 5 present a complete first round of the scientific process in a hypothesis testing investigation. If the
VII
PREFACE
book is to be studied in parts, these first five chapters represent Part 1; they belong together in the exposition of the argument. In Chapters 6 and 7 the concept of scientific objectivity is implemented. These Chapters treat fundamental, but largely well-known, issues; apart from a few highly abstract sections, they should read easily. Chapter 8, on the contrary, is a difficult part of the book. Except for psychologists of the graduate and doctoral levels, who are likely in this treatment of constructs and variables to recognize a generalized theory of test construction and test scores, this Chapter should be both hard and fundamental for social scientists from other fields. If Chapter 8 is thus a relatively independent part of the book, so, for other reasons, is the final Chapter. In loosening the ties with the prototype of hypothesis testing, the author shows in Chapter 9, more explicitly than in any of the preceding parts, where he stands on matters methodological. The treatment of such topics as descriptive investigations, the methodology of interpretation and of simulation, the use of mathematical models, unity versus duality in science, etc., occasions a number of attacks on current misconceptions and one-sided or unnecessarily restrictive fashions. The argument as a whole is in favor of the ideal of unity of science. Chapter 9 should appeal particularly to those who have an interest in philosophy of science — but may heaven prevent instructors from omitting this chapter when they use the book as a course text. Finally, a few words are in order on the English text as compared to the Dutch. The present text is a revised translation. The majority of the revisions are in the nature of adjustments, improved formulations, and a somewhat diminished redundancy. A few passages which referred specifically to problems or situations in the Netherlands have been either rephrased or omitted. In general, however, the line of the argument and the corresponding structure of the book have been strictly maintained. References in the text to the literature have not been updated systematically. In many cases, updating would have amounted to a fresh literature search for more recent examples — which are not likely to be better than the older ones used. In general, the principles of methodology are not expected to become antiquated within a few years. Moreover, many references in the text serve the purpose of pointing to the existence of pertinent evidence or literature rather than to induce the reader to rush off to the library and consult the sources. Consequently, a number of VIII
PREFACE
references to publications in Dutch have been retained in spite of the low likelihood that a reader who prefers the English edition over the Dutch will consult those sources. References to Netherlandish scientific literature may also remind some readers of the oft-forgotten fact that such a literature exists. As a result of this policy, the reader will find only few references to publications after 1961. Exceptions have been made for a few superior textbooks of a later date and for some striking new examples the author happened to have available. As a matter of course, the Bibliography and the Indexes have been adjusted accordingly. The idea of translating 'Methodologie' was primarily to make the book internationally available, not only to English speaking social scientists, but to all those who do read English and do not read Dutch. Hopefully, the English version will reach its audience as well as the Dutch edition has.
IX
ACKNOWLEDGMENTS
As the Dutch version of this book was written during the author's Fellowship year (1959-60) at the Center for Advanced Study in the Behavioral Sciences, Stanford, California, the appearance of the present English edition can be seen as the, somewhat belated, fulfillment of an implicit promise. This is, therefore, my first chance, in connection with 'Methodologie,' to express my deep gratitude in printed English to this unique institution. Apart from the opportunities for concentration and fruitful professional contacts offered by the Center, the original edition benefited greatly from other inspirational, critical, and supporting contributions of many people — whom I would now like to thank once more anonymously. The English edition is, again, a product of cooperation. During the years 1966 and 1967 Mr. Jop Spiekerman, The Hague, now and then consulting with me, produced a complete translation. This text was revised in 1968 by myself, largely at the Oregon Research Institute, Eugene, Oregon. Subsequently, it was re-edited by Dr. Lewis R. Goldberg, again in collaboration with me, and finally flown back to the Netherlands, where Mr. Spiekerman, Miss Merlien Evers, of the University of Amsterdam, and Mr. Arie Bornkamp, of Mouton & Co., the Hague, took care of the finishing touches. This cooperative effort was made possible by the following institutions, to which I am strongly indebted: the Netherlands Foundation for the Advancement of Scientific Research, Z.W.O., the Hague, for providing a grant for the translation of the text; the University of Amsterdam and the Research Instituut voor de Toegepaste Psychologie, Amsterdam, for granting me a year's leave of absence; the Oregon Research Institute, Eugene, Oregon, U.S.A., and its director, Dr. Paul Hoffman, for extending an invitation to me and offering their cordial hospitality and cooperaXI
ACKNOWLEDGMENTS
tion; the National Institute of Mental Health, Washington, D.C., for funding my visiting scholarship at O.R.I, as a part of Dr. Goldberg's project (MH 12972), at the latter's highly appreciated initiative. Any translation, even if revised, remains a hazardous attempt at transposing ideas from one language culture into another; the result can hardly ever be expected to be perfect. If the present text reads in many parts as if it were written in English, this is largely due to Jop Spiekerman's solid and artistic first draft and to Lew Goldberg's uncanny ability to transform awkward phrasings (abbreviated AWK) into transparent English prose. In addition, the latter suggested, or inspired by his criticism, a number of substantive revisions which without doubt have improved the text. It remains a translation, and there may have been some flaws in the, time-consuming, cooperative process; but for the rest, the reader had better ascribe any obscurities he may still find to obscurities in the author's mind. Next to translator and editor, thanks are due to all those who provided secretarial help; in typing the many consecutive drafts and corrections, in reading proofs, in preparing the revised Bibliography and Indexes, in maintaining the communication overseas, and the like. Miss Merlien Evers, Amsterdam, carried the heaviest load in most of these indispensable activities. At the Oregon Research Institute, Mrs. Linda Mushkatel's efficient cooperation in the final phase must be particularly mentioned. Finally, I would like to thank my wife, Elsa, first for her secretarial contributions on two boat trips and in the United States, but much more for her invaluable identification with the whole project, and with the author. ADRIAAN D .
Eugene, Oregon 1968 Amsterdam, Netherlands 1969
XII
DE
GROOT
TABLE OF CONTENTS
PREFACE
V
ACKNOWLEDGMENTS 1. T H E E M P I R I C A L C Y C L E I N S C I E N C E
1; 1 The acquisition of experience 1; 1; 1 The empirical cycle; without reflection . . . 1; 1; 2 The empirical cycle; as reflected 1; 1; 3 The shift from end to means; problem solving 1; 1; 4 The empirical cycle in thought processes . . . 1; 2 Higher tation 1; 2; 1 1; 2; 2 1; 2; 3 1; 2; 4 1; 2; 5 1; 3 Aims 1; 3; 1; 3; 1; 3; 1; 3; 1; 3;
XI 1
1 1 5 6 7
experiential processes: thinking, creation, interpreThe universal cycle of end and means . . . The creative and the hermeneutic cycle . . . Multiplicity of cyclic forms Indispensability of the cycle The empirical cycle; the reporting of experience.
9 9 12 14 16 17
and standards of empirical science 1 The aims of science 2 Selection of problems; degrees of certainty . . 3 Standards and techniques; logic and methodology 4 Unwritten rules 5 The forum
18 18 21 23 25 26
1; 4 The cycle of empirical scientific inquiry 1; 4; 1 The empirical cycle; in science 1; 4; 2 Observation
27 27 28 XIII
T A B L E OF C O N T E N T S
1 ; 4; 1 ; 4; 1 ; 4; 1 ; 4;
3 Induction 4 Deduction 5 Testing 6 Evaluation
29 29 30 31
2. D E S I G N I N G THEORIES AND HYPOTHESES
33
2; 1 Characteristics of hypothesis formation
33
2 ; 1 ; 1 The process of hypothesis formation . . . . 2; 1 ; 2 Freedom of design 2; 1; 3 Freedom of concept formation 2; 1 ; 4 Factual underpinnings 2 ; 1 ; 5 Theoretical framework 2; 1 ; 6 Interpretation of the facts 2; 2 Means and methods of hypothesis formation 2; 2; 2; 2; 2;
2; 2; 2; 2; 2;
1 2 3 4 5
45
Facts and ideas - two approaches Inspiration through the literature Empirical exploration Explorations of sample materials Methods of interpretation: 'Verstehen'; empathy
2; 3 Formalization : problems of choice 2; 2; 2; 2; 2; 2;
3; 3; 3; 3; 3; 3;
1 2 3 4 5 6
DEDUCTIVE
PROCESS
3 ; 1 Normative standards for formulation 3; 3; 3; 3; 3; XIV
1; 1 1;2 1;3 1;4 1;5
Antecedent formulation Logical consistency Principle of economy Testability Stated empirical reference
45 48 51 53 55 58
Language; verbal or mathematical Selection within one language Tentative of definitive General of specific Complex or simple Hypothetical constructs
3. F O R M U L A T I O N O F T H E O R I E S A N D H Y P O T H E S E S :
33 35 37 39 40 42
58 61 62 63 64 65 A.
THE 69
69 69 69 71 72 73
TABLE OF C O N T E N T S
3; 2 Deduction and specification
74
3; 2; 1 From general to particular 3; 2; 2 Theory, hypothesis, prediction: distinctions . 3; 2; 3 From hypothesis to prediction
.
3; 3 Explicitation of a theory or hypothesis 3; 3; 3; 3; 3;
3; 3; 3; 3; 3;
1 2 3 4 5
80
Explicitation: ramifications Nomological network Three types of relations Operational definitions of constructs . . . . Relation between construct and variable . . .
80 82 82 84 86
3; 4 The scientific prediction
89
3; 4; 1 Function, content, characteristics 3; 4; 2 Verifiability conditions and verification criteria . 3; 4; 3 Lack of falsifiability and other shortcomings . 4. FORMULATION OF THEORIES AND
74 77 78
89 93 95
HYPOTHESES:
B. CONFIRMATION
99
4; 1 Confirmation of hypotheses
99
4; 1; 1 Deterministic hypotheses 4; 1; 2 Probabilistic confirmation hypotheses 4; 1; 3 Relevance of predictions
99 and
probabilistic 101 105
4; 2 Acceptance and rejection of theories 4; 4; 4; 4;
2; 2; 2; 2;
107
1 Refutation of theories 2 Relative rejection and acceptance of theories . 3 Theory development 4 Development of theoretical constructs .
.
4; 3 Normative standards for the publication of theories and hypotheses 4; 4; 4; 4; 4;
3; 3; 3; 3; 3;
1 2 3 4 5
'Testability'necessary and sufficient Different forum conventions In quest of minimum requirements Explicitation essential Falsifiability
. .
.
107 Ill 114 115 119 119 121 124 125 126 xv
T A B L E OF
CONTENTS
5. FROM F O R M U L A T I O N TO T E S T I N G A N D
EVALUATION .
5; 1 Design of hypothesis testing investigations 5 ; 1 ; 1 Freedom of choice 5; 1; 2 Considerations pertaining to confirmation 5; 1; 3 Practical considerations 5; 1; 4 The importance of advance analysis . . 5; 2 From 5; 2; 5; 2; 5; 2; 5; 2; 5; 2;
formulation to test: an example 1 Psychosomatic specificity 2 Step by step specification of the problem . 3 Empirical specification of concepts . . 4 Experimental design; further specifications 5 Statistical testing: final decisions
5; 3 Testing 5; 3; 1 5; 3; 2 5; 3; 3 5; 3; 4
.
.
128
.
128 128 130 133 135
.
. .
. . .
and evaluation Execution of the testing procedure Disturbing factors Problems of generalization Cause or effect?
6. O B J E C T I V I T Y :
A. T H R O U G H T H E E M P I R I C A L C Y C L E
139 139 141 142 145 148 150 150 152 155 160
.
.
162
6; 1 The principle of objectivity 6; 1; 1 What is objective? 6; I; 2 Objectivity a basic requirement 6 ; 1 ; 3 Objectivity in research design
162 162 163 167
6; 2 From construct to objective variable 6; 2; 1 Instrumental realization; definitions . . . 6; 2; 2 The evaluation problem as an example; goal, effect, measure 6; 2; 3 'Insight gained': an objective instrument . . . 6; 2; 4 Objectivity and relevance 6; 2; 5 Development of instruments
169 169 172 175 176 180
6; 3 Objective selection of experimental (testing) materials . 6; 3; 1 Universe and sample 6; 3; 2 Diversity of universes 6; 3; 3 Objective sample selection 6; 3; 4 Objective elimination
182 182 185 188 193
xvi
.
TABLE OF C O N T E N T S 7 . O B J E C T I V I T Y : B. D A T A C O L L E C T I O N A N D A N A L Y S I S
.
.
198
7; 1 Objective questions and answers 7 ; 1 ; 1 The art of asking questions: precoding 7; 1; 2 The art of getting answers: coding 7; 1; 3 Ad hoc coding
198 198 203 206
7; 2 Question 7; 2; 1 7; 2; 2 7; 2; 3
210 210 211
7; 2; 4
form and processing techniques Relationships between collection and processing. Measurement and measurement scales . Scale construction and measurement as analogue representation Problems of isomorphism
7; 3 Judgmental procedures: intersubjectivity 7; 3; 1 Judges as measuring instruments 7; 3; 2 Specific problems in judging 7; 3; 3 Controls and precautions 7; 3; 4 'Disinterested'judges 7; 3; 5 The judge a subject; paired comparisons . 7; 3; 6 From expert to formula
214 218 221 221 224 228 232 234 237
8. C R I T E R I A F O R E M P I R I C A L V A R I A B L E S A N D I N S T R U M E N T S .
239
8; 1 Instrumental utility of a variable 8; 1; 1 Relations among basis concepts: a recapitulation 8; 1; 2 Instrumental utility: definition 8; 1; 3 Three construction requirements; three criteria .
239 239 242 245
8; 2 Validity 8; 2; 1 8; 2; 2 8; 2; 3 8; 2; 4 8; 2; 5
248 248 249 254 257
Criterion validity as a simple operational concept Criterion problems Construct validity: measurement versus prediction Contributions to construct validity . . . . How to assess construct validity: a theoretical problem
8; 3 Accuracy and stability, reliability 8; 3; 1 Differentiation of the measurement scale . 8; 3; 2 True value and chance error 8; 3; 3 Measures for the reliability of an instrument . 8; 3; 4 The stability problem
259 262 262 265 269 273 XVII
T A B L E OF C O N T E N T S
8; 3; 5 8; 3; 6 8; 4 Internal 8; 4; 1 8; 4; 2 8; 4; 3
Significance and uses of reliability measures . From measurement outcome to conclusion efficiency and scoring Internal efficiency Internal consistency Problems of scoring and scale construction
9. D I V E R S I T Y A N D U N I T Y I N S C I E N T I F I C R E S E A R C H
9; 1 Different 9; 1; 1 9; 1; 2 9; 1; 3 9; 1; 4 9; 1; 5 9; 1; 6
278 281 283 283 284 290 298
types of investigations Limitations of this study Five types of investigation: 1. hypothesis testing 2. Instrumental-nomological investigations 3. Descriptive investigations 4. Exploratory investigations 5. Interpretative and theoretical studies
298 298 301 302 303 306 309
9; 2 Methodology of interpretation 9; 2; 1 The interpretation problem: an illustration 9; 2; 2 Interpretation as an extension of explanation . 9; 2; 3 Testing through extrapolation 9; 2; 4 Convergence within the universe 9; 2; 5 Testing by partitioning the universe . . . .
310 310 313 319 320 324
9; 3 Complex 9; 3; 1 9; 3; 2 9; 3; 3 9; 3; 4
problems and devices Multiplicity of variables Complex procedures Mathematical models Machine models: simulation of behavior .
327 327 333 335 338
9; 4 Unity of 9; 4; 1 9; 4; 2 9; 4; 3 9; 4; 4 9; 4; 5
science Idiographic-nomothetic: a difference in method? Misconceptions concerning 'uniqueness' . Relative differences Objectivity and other values Unity chosen
343 343 345 348 353 356
BIBLIOGRAPHY
357
I N D E X OF NAMES
373
I N D E X OF S U B J E C T S
379
XVIII
CHAPTER
1
THE EMPIRICAL
CYCLE
IN SCIENCE
1;1
THE ACQUISITION OF EXPERIENCE
1;1;1 The empirical cycle\ without reflection
Empirical science seeks to gain knowledge of the world, that is, of the reality in which we live. Each particular scientific discipline endeavors to cover a certain, more or less well-defined sector of this world. To this end experiences (observations, results of experiments) pertaining to this sector are systematized in a manner which it will be our object to study. Basically, the activities of the man of science can be regarded as a specific instance of the various ways in which the human organism explores reality and adapts itself to it, or in other words, learns to control and manipulate its peculiar characteristics. The scientific investigator who seeks to condense his factual experiences and data into knowledge of the world is no more than a special case of the organism transforming its experiences and observations into experience about the world — which will enable it to function more efficiently than it did before, in a less experienced state. Both the most sophisticated scientific discipline and the most primitive form of gaining experience — that is, one totally unaccompanied by awareness or reflection — fall within the concept of 'learning,' in the broad sense of the word that has become current in psychological writings (cp. e.g. HILGARD 1958, pp. 2-6). Let us first consider the process of gaining experience (learning) without reflection. The term is used here to indicate learning processes whose effect is exclusively determined from the better — i.e., faster, more precise, or otherwise more effective — performance of apparently goal-directed behavior as a result of previous experiences in similar situations. In learning processes 'without reflection' we have no reason to assume that i;i; l
l
1. T H E E M P I R I C A L C Y C L E I N
SCIENCE
the learning organism is aware of the process or of its result: the experience gained. By contrast, knowledge can be defined as experience (of the external world) which the subject is aware of and can express in language, in the form of statements. Obviously, gaining experience without reflection is known primarily from observations and experiments involving animals and young children who have not yet the use of language. But in adults, also, similar non-conscious or barely conscious processes are frequently met with, for instance, in automatic or unconscious learning (see e.g., V A N PARREREN (1960) 1966, Ch. 4). By way of preamble to our proposed analysis and structural description of scientific thought and procedure, it is useful — though not strictly necessary — first to consider the basic principles involved in any process whereby experience is gained without reflection. We can discern, then, a cycle of activities which apparently is repeated continually. It occurs both on a large and on a small scale, either in a more or less pure form or in a complex system of cycles interacting with other processes and with reactions of 'the world.' For an organism, O, in a situation, Si, the empirical cycle can be schematically represented as follows: (External World)
(Organism)
(1)
Si
• O
(3) I
S'l = Si + A S
(4)
• O'
in which: Si = O = R = AS = 2
the situation as it presents itself to the organism; a reaction of O; the effect of R on S; 1;1; 1
1; 1 T H E A C Q U I S I T I O N O F
EXPERIENCE
Sj = the modified situation(as it presents itself to O) after O's reaction R; O' = the organism in its modified state after the experience has taken effect. The question that will interest us most is what processes (influences, activities) are represented by the arrows. (1) stands for the process of perception: Si affects O; O perceives certain (selected) aspects of Si, 'cues' to which he reacts through R. If, as in cybernetics, the organism is viewed as a machine-like system, these cues constitute the input. (2) represents O's reaction (action) to, or within, the situation. In terms of a process aimed at gaining experience, this reaction can be regarded as an attempt whereby one out of a number of more or less clearly definable possibilities is tried. In machine terms: (2) is a transition to a different state of O, R being its output. (3) stands for the process in 'the world' which causes O's reaction, R, to produce a result, namely, some change (AS) in the situation, S, through which it becomes S'. In other words, (3) is the process by which the world reacts to R through AS.1 (4), finally, again represents a process of perception or, at any rate, one in which information is taken in by O: his evaluation of S'. This evaluation is twofold: it must be determined, first, whether the change AS has turned out for good or ill (for O), second, what this means to O — what O 'has learned from it' — namely the potential influence of the experience on future S-O-R behavior. This last is the actual 'learning effect,' which — through 'feedback' •— changes O to O'. This schematic representation is, of course, not the only one possible. It is, however, clear enough to enable us now to summarize the cycle of activities. Bearing in mind that (3) represents an activity of the external world, not of O, we can sum up the cycle of O's activities as follows: 'perceive' - 'try' - (result) - 'evaluate''
1
The general validity of this scheme can be maintained only if no reaction (R = O) is also regarded as a form of reaction. Correspondingly, the possibility that AS = O must be admitted as well. The organism may gain experience without exhibiting a manifest response; and it may learn from cases in which S is not observably changed by R.
l; l; l
3
1.
THE
EMPIRICAL
CYCLE
IN
SCIENCE
These terms have been put in quotation marks to indicate that they are used in a sense which differs from their ordinary meaning. When a human subject is said to observe, try, or evaluate something, it is implied that he does so consciously and deliberately. In the description of an experiential process unaccompanied by reflection, however, this implication is obviously unwarranted. Moreover, the concept 'try' is in another way more comprehensive here than when we speak of 'trying out something,' or 'proceeding by trial and error.' By definition, the organism is here supposed to be engaged in a process of 'trying' whenever there are valid grounds for proceeding on the assumption that it could have acted, or can learn to react, differently; at this point there is no need to specify what these grounds might be. Regardless of how objectionable the above conceptualizations (by means of quotation marks) may be, a number of notions of some such general nature are needed if we are at all to understand the interactions of organism and external world by which experience is acquired. We cannot do without the assumption of some process of 'observation,' followed by one of 'trying' in the sense indicated above, and one in which the information is digested — which must be labeled by some such term as 'evaluation.' If learning-from-experience is considered to be possible at all, this or a similar mechanism is axiomatic.1 The processes encountered in empirical investigations of learning processes can generally be expected to be more complex; they may comprise overlapping and combined cycles. Nevertheless, the functioning of 'observe'-'try''evaluate'-cycles as basic units must be assumed. Now, what will happen if, after his experience in Si, O finds himself in an analogous situation, S2? The cycle will repeat itself, in a slightly modified schematic representation, as follows:
1 In various fields variants of this cycle are in use. The 'reinforcement of a reflex' encountered in the psychology of learning, for instance, can be accommodated in this same generalized scheme as a special case. If we define 'the external world* as everything providing relevant information (including sense perception) for the regulation of the activities of the organism, the cycle is to all intents and purposes identical with the feedback loop of cybernetics. There is an evident analogy with the TOTE-unit of M I L L E R , G A L A N T E R and P R I B R A M (1960). Perhaps this is even more than an analogy, but it would lead us too far to go into this now.
4
l; 1 ; 1
1, 1 T H E A C Q U I S I T I O N O F
EXPERIENCE
(External World)
(Organism) Si Sj
- O
reaction: 'trying' — evaluation
recurrent analogous situations
S2
- perception -
O'
-> O'
S'2 - 11->3 —>
37
2.
DESIGNING
THEORIES AND
HYPOTHESES
which itself can be neither observed, nor readily derived from observational data, nor yet easily reduced to empirical terms. It is contended by some that hypothetical constructs ought to be banned altogether or to be made subject to formal restrictions. We shall revert to this question in greater detail in 2; 3, where examples are discussed. At this point only one general remark is in order. The movement seeking to impose restrictions on hypothesis formation originated in (neo-)positivist thinking in philosophy. Along with its attacks on metaphysics, it sought to rule out, or at least to curb, those broad, sweeping generalizations and 'explanations' in the sciences and the humanities which defy empirical testing. While this aim is of unquestionable importance, the outlawing of all or most hypothetical constructs is far too restrictive a means of redress. It can be shown to be both untenable and superfluous — largely on the same grounds as were stated above (2; 1; 2). First, any 'ruling' of this kind would be gratuitous, since what has been said above about the untenable nature of any 'principle of induction' applies with equal force here. Moreover, had such restrictions ever been enforced in the past, they would have seriously impeded the advance of theory. No doubt one might point to a number of ill-chosen, vague notions such as 'phlogiston' or 'the ether,' which must have hampered scientific progress. These failures, however, are outbalanced by other, more felicitous concepts (e.g. the 'atom'), which have proved extremely fruitful, although when they first came to be formed, they were just as ill-defined and 'wild.' Finally, such restrictions are superfluous since, as we shall see, the process of conceptformation can very well be controlled retrospectively, that is, subsequent to the explicit formulation of the concept. The theoretical system as a whole, and the deduction of hypotheses from it, can be subjected to logical criticism, while the results of the testing of these hypotheses provide excellent materials for a critical evaluation on an empirical basis. In such an evaluation, then, the hypothetical constructs employed (in the theory) are a frequently chosen and quite legitimate target. So we range ourselves on the side of freedom in the formation of concepts, on the express understanding that 'unrestricted' freedom prevails only up to the point at which they are given their definitive formulation, i.e., where theory, hypotheses (and evidence) are ready for a first publication. This is the point where the investigator enters phase 2 38
2; 1; 3
2; 1 C H A R A C T E R I S T I C S
OF HYPOTHESIS
FORMATION
(induction: formulation of hypotheses). Emphatically, requirements must indeed be imposed on the actual formulation. These requirements will be set forth in Chapters 3 and 4 — where, for that matter, the problem of how to avoid the harms of a dogmatic excess of requirements is taken up anew (cp. particularly 4; 2; 4). 2; 1; 4 Factual underpinnings
Freedom in matters of design also extends to the degree and manner in which the investigator chooses to base himself on factual data available through earlier investigations. In principle, he is free to ignore them to a greater or less extent, or even not to know them. The risks involved will be obvious. First, he stands a fair chance that his co-scientists, who may be given to reckoning with the facts, will not take him seriously — whether rightly or wrongly is another matter. Secondly, those neglected facts may themselves contain materials for a thorough refutation of the new theory or hypothesis, without even requiring new experiments or observations; the hypothesis is then stillborn. A third possibility is that the new idea may indeed be good, but that it could have been worked out more adequately, if the investigator had paid more attention to the known facts. In general, the odds against successful hypothesis construction not based on systematic spade-work will be heavier according as the field in which the investigator is going to work has been more thoroughly explored. If, in his particular domain, there have been many predecessors, he will be wise to refer to what these others have done, either by taking up more or less where they left off, or by rejecting their work on valid grounds. Alternatively, a new approach may enable him to show their results in a different light. In any case, he will have to know their work at least in broad outline, and find some way of bringing his own into line with theirs. This, however, is only a recommendation, not a strict requirement. There can be no doubt that in many disciplines, and certainly in the behavioral sciences, freedom of design, also concerning the factual substratum, still has a real meaning. In many areas there is still room for unorthodox, revolutionary, theoretical ideas and revisions. Invariably, there is, of course, some sort of factual or rather observational substratum — empirical hypotheses do not appear out of the blue. Even where there are no factual underpinnings derived from 2;1;4
39
2.
DESIGNING
THEORIES AND
HYPOTHESES
scientific investigation, there are, minimally, the investigator's own perceptual experiences and observations. 2; 1; 5 Theoretical framework
Scientific hypotheses seldom, if ever, stand alone; they mostly derive from, and fit in with, a framework of theories covering a whole range of phenomena. Literally, theory means a 'beholding,' a 'view.' Here we understand by it a system of logically interrelated, specifically non-contradictory, statements, ideas, and concepts relating to an area of reality, formulated in such a way that testable hypotheses can be derivedfrom them. Variations are possible in both the degree of logical tightness of the system and the strictness of its testable derivations (in the ideal case: an axiomatic system with purely logically derived hypotheses and indisputable operationalizations). A system of statements which, because of the terminology employed, or for other reasons, does not lend itself at all to deductive derivation of testable hypotheses is not a theory in the sense understood here. 1
A theory may be regarded as a system of propositions by which constructs are related to each other — i.e., it may be regarded as completely divorced from reality. In empirical sciences, however, such a system functions as a model of the area of reality covered by the theory. 2
1
Apparently, this definition does not impose very rigorous standards of explicitness on a theory. The author thus expressly departs from some current (American) views. According to these, only rigidly deductive systems — with primitive and defined terms, axioms, deduction principles and strictly operational definitions — are to be termed theories. On this view, most attempts at theory construction in the social and behavioral sciences would be, at best, no more than 'prototheories.' Such definitions are at variance with established usage in these sciences, since they would virtually deplete the class of 'theories.' Worse, they introduce an 'ideal' which, in most cases, is unattainable either for the time being or on principle. Perhaps their most damaging effect is that they tend to set up a false hierarchy of 'status.' As a result, they militate against the formation of good 'prototheories' — which are badly needed — and promote the premature use of mathematical models. Compare also 2; 3; 1, 2; 3; 6, 3; 3, 4; 3 and especially 9; 1. 2 Like theory, 'model' is not defined here in any strict sense. The connotations in which mathematicians, physicists, psychologists, sociologists, etc. use the term differ considerably, but they have this common core of meaning (cp. SUPPES 1960): a model defines formally, preferably in terms of set theory, what is needed by way of (notions or variables for) objects, relations and valid operations, to enable deductions to be made. 40
2; 1;5
2;1
CHARACTERISTICS
OF HYPOTHESIS
FORMATION
It is linked up with the observables of empirical reality through the hypotheses which must be derivable from the theoretical statements by means of deduction and specification. Viewed in this (deductive) light, theory construction is of great importance as a method of producing judicious, hierarchically related, hypotheses which, when tested, can help to increase our knowledge in a systematic way. When an investigator surmises significant relations of dependence, he will, as a rule, endeavor to relate these to a more general, theoretical framework. Once such a logically valid framework has been (tentatively) constructed, it offers numerous possibilities in the way of elaboration, specification, and testing. A theory may thus be 'fruitful' and, at the same time, promote the logical and systematic integration of investigations in a given field. For the development of an empirical science, there is 'nothing more practical than a good theory' ( L E W I N 1951, p. 169). At the same time, theory construction is more than just a means to an end: 'theory is both a tool and an objective' ( M A R X 1956, p. 6). The scientist's striving for knowledge concerning reality culminates in his attempts to construct comprehensive systems of pervasive functional relationships, that is, systems encompassing entire areas of reality and covering the phenomena encountered within these with an optimum degree of adequacy. Conversely, the construction of such systems — theories, that is —• requires hypothesis construction and hypothesis testing. In fact, there is a constant interaction — as was to be expected (cp. 1; 1 and 1;2). Just as in empirical science there are always factual underpinnings, so there is invariably a theoretical framework in hypothesis formation. At least, there is always a basic idea of more general import, as well as, minimally, an effort to develop this into a theory. It would not be true to say that there is invariably a theory. Some investigators deliberately keep aloof from theory construction. They are indeed pursuing a plan ( M C C O R Q U O D A L E and MEEHL (1948) 1956, p. 107), but they are unwilling (as yet) to mold this into a definite structure of constructs and statements. They choose to build the theory step by step from empirical findings and empirical laws of a lower order (e.g. H U L L 1943; W O O D R O W 1942). Furthermore, in applied areas there are sometimes just a few more or less isolated hypotheses. This will particularly be the case in early stages of development, when social and scientific 2; 1; 5
41
2. D E S I G N I N G T H E O R I E S A N D
HYPOTHESES
interest does not yet go beyond seeking an answer to one, or perhaps a few, analogous questions. As soon as the field of inquiry is broadened somewhat, there will be a call for theory formation. This is exemplified by the recent development of theories concerning public opinion (ALLPORT 1937; H Y M A N 1957), by test theory and selection theories (e.g., G U L L I K S E N 1950; C R O N B A C H and GLESER (1957) 1965). Another possibility is that theory construction is indeed actively pursued, but that all attempts break down because reality proves utterly intractable. A characteristic example is the investigation of parapsychological phenomena (telepathy and clairvoyance; compare e.g. EYSENCK 1957a, Ch. 3, pp. 140-141). In such cases the theoretical framework is still too implicit and/or too primitive to be worthy of the name of theory. Again, a purely practical restriction must be imposed on the freedom of design in theory and hypothesis construction. Even more than in the case of factual underpinnings, the investigator will be well-advised to acquaint himself thoroughly with the theories of other investigators in the same or in analogous fields. For one thing, there is the possibility that an independently conceived new line of thought, which the investigator wants to develop into a hypothesis or a theory, will fit in with an existing theoretical framework, or can be brought into line with it. Another frequently employed procedure is that the investigator uses an existing, older theory as a 'booster' to get the new theory off the ground. When properly used, this method can be very fruitful, in particular if attention is focused on those points on which alternative theories give rise to conflicting hypotheses and/or opposed predictions, which can be decisively tested in crucial experiments. 2; 1; 6 Interpretation of the facts
We have seen that hypothesis construction presupposes factual or observational underpinnings; there is always a substratum of experience on which the investigator can base himself. In certain cases this may be entirely implicit, that is, unsystematized and unrecorded. On the other hand, it may well be a substantial collection of systematically recorded phenomena, observations, and 'measured' results from the investigator's own explorations and/or from the factual data (and theories) produced by investigations of others. Whatever the composition of these empirical materials, a new hypothesis formed on the strength of them will invariably depend on a certain 42
2; 1; 6
2; 1
CHARACTERISTICS OF HYPOTHESIS
FORMATION
interpretation of the materials at hand. Such an interpretation of the available facts is often — though not always — a clearly discernible step in the process of hypothesis formation. To support this statement, we shall have to define the notion of 'interpretation' or, at any rate, to delimit it by contrasting it with such notions as hypothesis and explanation. It is hoped that the following correspondences and differences will serve to bring out the distinctions sought. 1. One 'interprets' or 'explains' something, a set of data, that is, a specifiable collection of phenomena or facts. The collection may consist of no more than one phenomenon which one seeks to explain or interpret. It may equally well be a vast complex of facts, for instance, in the case of a historian giving an interpretation of the French revolution. 2. In any case, the collection is a closed set, in the sense that in the course of the interpretation (or explanation) no reference is made to materials outside the collection. No efforts are made to add to the data by new observations. Nor is it implied that new observations are or must be possible; they may or may not be, but in either case the interpretation (or explanation) does not relate to them. 3. In an interpretation or explanation it is assumed that the phenomena within the collection which are to be interpreted (or explained) can be attributed to the functioning of a regularity or law of a more general nature, which, therefore, is supposed to manifest itself, in some form or other, also outside the closed set. 4. In the case of an explanation the more general regularity or law is accepted — perhaps only provisionally — both in its more general validity and in its applicability to the given closed set. In the case of an interpretation, however, there is room for doubt as to this validity or applicability, or both. 5. A hypothesis is an 'open,' presumed, more general regularity or law; new evidence, new observations are considered possible; they are in fact referred to, notably in the sense that the hypothesis can be tested by means of such new observations.1 1 A variant of the latter is the ad hoc hypothesis. This is a hypothesis which is constructed 'ad hoc' (for this case alone) for interpretational use. The term is applied mainly to hypotheses which must serve to dispose, by way of interpretation, of certain unexpected or unwelcome empirical findings. If such is the case, its origins are not apt to inspire confidence, while there is also reason to doubt the sincerity of the reference to testing by obtainable new observations (5). According to our definition, an ad hoc
2; 1; 6
43
2.
DESIGNING
THEORIES
AND
HYPOTHESES
In a more statistical terminology, the essentials of points 1 through 5 can be summarized as follows: A hypothesis is a supposed law (i.e., an expression of significant relations of dependence) in a well-defined universe. An explanation and an interpretation both attribute phenomena in a given sample to the functioning of a law in a universe — which may or may not be precisely defined — under which the sample is subsumed (or: from which the sample is supposed to be drawn). In an explanation both the functioning of the law in the universe and the legitimacy of the subsumption are accepted; in an interpretation, on the other hand, at least one of these points is dubious. Now, if, as we have seen, a hypothesis is always formed on the basis of experiential data (materials) — the factual or observational substratum — it is obvious that hypothesis formation must be preceded by an interpretation of this 'closed' set of data. For hypothesis formation (first phase) some form of interpretation is indispensable. In 1 ;4 we have seen that interpretation is also an indispensable factor in every evaluation (fifth phase), and, likewise, that, in the light of the overall cyclic progression of scientific inquiry, the (old) fifth phase merges into the (new) first phase. This also applies to the function of interpretation in the two phases. The only difference is that in the one case a process of interpretation opens a cycle of inquiry, for the direct purpose of hypothesis formation, whereas, in the other, it winds up the present investigation. But in the latter case, too, the interpretation prepares the ground for new, related investigations. Viewed in the light of the entire scientific process, interpretations serve the purpose of hypothesis formation. In other words, if interpreting means subsuming a given set of data under a universe in which a general law or regularity not yet formulated as a hypothesis is surmised, then a given interpretation naturally raises two questions. The first is: how to arrive at an explicit formulation of the supposed regularity, i.e., as a genuine hypothesis (containing a specification of the universe); the second is: how, then, to test the resulting hypothesis 1 . hypothesis is indeed a hypothesis as far as its form is concerned. Unlike Eysenck's use of the term, it does not include pseudo-hypotheses, which are basically non-testable (EYSENCK 1952b, pp. 12-13). 1 'To ask for the cause of an event is always to ask for a general law which applies to the event' (BRAITHWAITE 1955, p. 2). The author uses this argument to justify
44
2; 1; 6
2;2
MEANS AND
METHODS
OF HYPOTHESIS
FORMATION
The fundamental, functional role of interpretation in the process of hypothesis formation can also be formulated as a requirement to be imposed on a good, that is, a scientifically fruitful, interpretation. This requirement is entirely analogous to the qualification stipulated for a good theory (cp. 2;1;5). An interpretation must be capable of transformation into testable hypotheses. An 'interpretation' which, because of the terminology employed, or for other reasons, cannot be broken down in such a way that its implicit hypotheses are specifiable does not lend itself to scientific use. It is not an 'interpretation' in the sense understood here 1 . For a more detailed discussion of methodological guidelines — recommendations again, not strict rules — for judicious interpretation, we must refer to later sections, particularly 3;3;6 and 9;2. The 'methodology of interpretation' is a complex subject, which requires more attention than can be given to it in these introductory paragraphs.
2; 2 M E A N S A N D OF HYPOTHESIS
METHODS FORMATION
2; 2; 1 Facts and ideas — two approaches
The reader can hardly have failed to observe that there is a certain polarity between the 'factual underpinnings' and the 'theoretical framework.' In common parlance we are apt to contrast facts with ideas. This is a good springboard for the ensuing discussion of methods and 'techniques' 2 of hypothesis formation, for we can in fact discern a dichotomy. Basically, there are two approaches open to the investigator seeking a vantage ground for the construction of sound hypotheses and theories. He can make himself more fully conversant with all the relevant
his definition of the goal of the (natural) sciences as the formulation of general laws. 1 We shall see below that this statement needs a few minor qualifications (cp., e.g., 9;2). 2 In thus making a transition from the strict demands of logic and (normative) methodology to a discussion of aids and techniques to be recommended for some problems, we come close to crossing the borderline into the province of what has been termed 'technicology' as opposed to methodology. 'Technicology,' however, is here regarded as being inherently part of (descriptive) methodology. In point of fact, just a few, very general, 'technicological' questions will be touched on.
2; 2; 1
45
2.
DESIGNING
THEORIES AND
HYPOTHESES
facts. Alternatively, he can find more fruitful ways of ordering and interrelating such facts or, in other words, produce better ideas. Of course, there is something artificial in differentiating between these two aspects. They are interdependent, and they join forces in the process of hypothesis formation; through interpretation, as we have seen. In many cases, the attention given to one aspect will have a natural concomitant in the attention bestowed on the other. However, it is by no means inconceivable that in some situation (knowledge of) facts and (receptivity to) ideas should be more or less incompatible psychologically. This is particularly true of the first 'aid' or 'method' — the term is used loosely — which we propose to examine: systematic procedures. Although the principle of freedom of design implies that throughout hypothesis formation no strict methodical requirements are to be imposed, it will be quite obvious that deliberate systematization can be an important aid. Obvious examples abound, and we shall mention just a few. In astronomy, as in many other sciences, systematically sustained observations and measurements have at all times played a fundamental role (cp. the use Newton made of observations by Kepler, who in turn used Tycho Brahe's). Medical science cannot do without systematic case studies, much as systematic description and classification are the mainstay of a science like zoology (cp. Darwin's theory construction). Again, systematic, descriptive statistics are the cornerstone of demography and sociography. Naturally, the relative importance of such systematically descriptive activities — recording, ordering, grouping, classification — may vary widely for different sciences. There can be no doubt, however, that we are dealing with a scientific, if not exclusively scientific, working method, which may be invaluable in the formation of hypotheses and theories (cp. also 9;1;4). On the other hand, the systematization of facts may be a hindrance to the conception of good ideas. A pecularity of all groupings and classifications of materials is that they are based partly on obvious external criteria springing from the materials themselves, and partly on presumably significant, more abstract viewpoints. However, both kinds of criteria, particularly the more abstract ones, anticipate hypothesis formation and tend to sway it in a particular, perhaps undesirable, direction. Evidently, some form of abstraction is inherent in any grouping or classification. One orders in terms of certain correspondences and differences — and in doing so one abstracts from other correspondences 46
2;2;1
2; 2
MEANS A N D M E T H O D S OF H Y P O T H E S I S
FORMATION
and differences. Indeed, even in describing one case, one event, one observation, some form of selection according to certain external or more abstract viewpoints is inevitable. This also applies to a detailed, ostensibly 'exhaustive' description. One cannot do without words, and words of themselves unavoidably accentuate certain nuances and aspects. Now, the problem of systematic description, that is, of ordering and classification, is that the elements of emphasis it introduces, combined with the abstraction employed, may obscure aspects of the materials which are actually of greater significance for the purpose of hypothesis formation. This problem may also be described as that of finding the most appropriate, theoretically productive distinctions and basic concepts, or variables to be measured. It is met with in all the sciences. Mechanics, for instance, could not make any real headway until the concept 'force' had been defined as being the cause of a change of motion (acceleration), not as the cause of motion itself. In other words, it had to wait upon the advent of a new viewpoint in the classification of phenomena of motion (see e.g. M A R C H 1957, pp. 20-22). Even in botany, which is commonly regarded as a chiefly descriptive science, it has often proved very difficult to find those, again descriptive, criteria which were of relevance for hybridization purposes, and thus for theory formation (see e.g. on Indian corn: A N D E R S O N and B R O W N 1952). Additional examples could easily be given, for instance, concerning the transition from (the barren system of) alchemy to chemistry. The question, by what criteria one should observe, classify and measure, is of particular weight and relevance in those sciences which concern themselves with the study of not very concrete, and indivisibly complex, phenomena: the cultural sciences, the social sciences, and especially psychology. In consequence, it is particularly in areas like these that, as a safeguard against premature, unfruitful, conceptualization and systematization, a diametrically opposed method has been evolved. This method can best be described as one of systematic reflection upon the basic phenomena. True, the devotees of phenomenology disclaim all systematics, but theirs is at any rate a systematic non-systematization, since it is a 'method'. The phenomenologically oriented investigator (systematically) tries to avoid any form of classification and any preconceived theoretical line of thought. That is to say, he seeks, through unbiased, 'open-minded' reflection upon the pre-rational (i.e. human) significance 2;2;1
47
2. D E S I G N I N G T H E O R I E S A N D
HYPOTHESES
and 'value' of phenomena, to penetrate to their core, to their 'essence' (BOCHENSKI 1954, Ch. 2; M E R L E A U - P O N T Y 1945, pp. 1-77; cp. also K O U W E R 1953). Whether, and to what extent, such 'open-mindedness' can be realized is a question on which a good deal more might be said (cp. e.g. MULDER 1954). Nor would we care to subscribe to the phenomenologist's claims concerning the unique, objective value of phenomenology, whether labeled scientific or not. Here, the phenomenological method is regarded solely as another approach to hypothesis formation, our purpose being merely to bring out the signal services it is capable of rendering, especially in counteracting premature acceptance of concepts, variables, constructs and classifications. Accordingly, the phenomenologist's aim to penetrate to the 'essence' of the phenomena is paraphrased here as an endeavor to find new viewpoints which, it is hoped, will prove more fruitful. Between these two extremes there is, of course, a whole range of other possibilities in the way of systematization, comprising both methods of ordering and of reflection, one of these being systematic interpretation (cp. below 2;2;5). Another possible avenue — and this takes us back to practical recommendations (techniques) — is through the systematic perusal of the relevant literature, in quest of either facts or a theoretical framework. Again, empirical explorations may be undertaken to collect new observations; or a given empirical material may be systematically scrutinized from different angles to establish new viewpoints. Each of these techniques will be treated briefly below. 2; 2; 2 Inspiration through the literature
It should hardly need stressing that a systematic study of the relevant literature may be an important, often indispensable, aid to hypothesis formation. Obviously, the investigator's grasp of both facts and ideas will be strengthened by studying the published work of others. This will acquaint him with the basic concepts and the terminology employed, with the ideas and views others have taken as their starting-points and with the theories and hypotheses they have evolved. To some extent, however, the antithesis between facts and ideas manifests itself also in this area. On the one hand, there is always a danger that the refutation — or confirmation — of a newly conceived hypothesis will be a foregone conclusion, if the investigator is not 48
2; 2; 2
2; 2
MEANS AND METHODS
OF H Y P O T H E S I S
FORMATION
thoroughly familiar with all the facts. On the other hand, there is a risk that fruitful hypothesis construction will be inhibited, if the investigator is indissolubly wedded to the facts, to traditional notions, and the commonly accepted problem formulations. Both risks will become the greater as the march of science advances in the field in question, that is, as the number of studies one simply must have read before one can make a contribution of one's own, increases. Again, we shall have to look for diametrically opposed remedies. The obvious way to improve one's grasp of fact is to make a thorough study of the publications bearing directly on the subject. Naturally, these should in the first place be broad surveys and thorough compilations, if these are available in a sufficiently reliable and detailed form. The line of division between what does and what does not have 'a direct bearing on the subject' can often be drawn quite sharply. If, on the other hand, our concern is with ideas, that is, if we are out to find new categories, new conceptualizations or theoretical viewpoints, we shall often fare better if we cast our net wide and seek inspiration farther afield. In substantively widely divergent areas the basic problems will frequently be found to exhibit similar structures, so that a certain amount of analogous borrowing will be warranted (OPPENHEIMER 1956). The natural sciences, in particular, have often served as models for theoretical developments in the behavioral sciences. In fact, a good many modern efforts to employ exact scientific methods, such as axiomatization and mathematical models, are based squarely on arguments from analogy. The reasoning is that methods which have been spectacularly successful in mathematics and the natural sciences should also prove effective in the behavioral sciences. As early as 1860 Fechner, and at a later date authors like Thurstone and Stevens, arguing in favor of the introduction of ratio scales in psychology (cp. 7;2;2), referred in so many words to the natural sciences and their achievements (see e.g. STEVENS 1951). The same applies to the use of the strictly 'hypothetico-deductive' method of theory development (e.g., HULL 1952; EYSENCK 1950, 1952b).
The growing teamwork among the sciences is certainly conducive to the spread of borrowing and to mutual experimentation with theoretical models. In areas such as the study of decision processes and cybernetics, which are meeting places for a number of quite diverse sciences, experimental borrowings are deliberately promoted (cp. e.g., DUNLAP SYM-
2; 2; 2
49
2. D E S I G N I N G T H E O R I E S A N D
HYPOTHESES
POSIUM 1 9 5 5 ; T H R A L L , COOMBS, D A V I S 1 9 5 4 ; B U S H a n d
ESTES
1959;
see also the literature referred to in 9;3;3 and 9;3;4). Of course, there is a risk involved in such borrowings from alien quarters, in that an unsuitable or unfruitful theoretical framework may be forced upon a given field. In the history of a science like psychology, which so to speak was begotten out of borrowings, there have been frequent, and sometimes quite justified, complaints about 'physicalism,' 'atomism,' unduly 'mechanistic' systems, and so forth. There have also been grounds for complaint about 'pathologism' (DE GROOT 1952a, p. 200), that is, the use of the sick-well model in the psychology of the normal (e.g. in typologies and personality theory). However, it is particularly the frequent borrowings from the natural sciences that have repeatedly provoked sharp criticism. A formidable array of objections against this development in sociology and related sciences may be found in Sorokin's angry, one-sided, but nevertheless highly instructive book (SOROKIN 1956). Compare, for instance (p. 187): 'Most of the theories examined above (...); most of the psychological tests analyzed; most of the pseudo-experimental procedures mentioned — all are, to a great degree, manifestations of the same infectious fad of building up the psychosocial sciences into the alter ego of the physical sciences.'1 We shall confine ourselves to a single comment on this issue, which from time to time raises its head also in other sciences besides psychology and sociology. Borrowing is, in itself, a neutral matter. Any resemblance a model in the behavioral or social sciences may bear to a model in the natural sciences makes it neither venerable nor reprehensible. In fact, only two valid arguments can be advanced against an allegedly inadequate theoretical model. The first is that there is no need for the new model, which is tantamount to saying that investigations in the field in question can proceed equally well without the new theoretical framework. Actually, of course, restraint in theory construction is an important, at times sadly neglected, virtue, very much as reserve towards, rather than 'faith' in, established systems is commendable. The second possible argument is provided by the construction of a better model. Both answers, 'there is no call for it' and 'there is a better alternative,' can be substantiated only by continued investigations. From the viewpoint of empirical science, purely speculative critiques, in which 'atomism* or any other allegedly 1
In his address to APA psychologists Oppenheimer has also sounded a warning to this effect (OPPENHEIMER 1956).
50
2; 2; 2
2; 2 M E A N S A N D M E T H O D S O F H Y P O T H E S I S
FORMATION
objectionable '-ism' 1 in a theoretical system is decried, are not in themselves of great consequence. They can become significant only if they are an integral part of systematic preparations for (new) hypothesis formation and ensuing new empirical research. All the same, there is something to be said for including purely theoretical and speculative writings in any systematic study of the literature for purposes of hypothesis formation. Even if they fail to satisfy the requirements which will be set out in the next two chapters (cp. 3;1;4 and 4;3;4), they may still contain useful ideas and yield good startingpoints (much as in daily life fiction and other reading will sometimes spark off ideas). Of course, the reader must take care not to become ensnared in such critico-speculative writings. A die-hard holistic argument advanced against all simplifying 'isms' is, for instance, that they are jejune and one-sided in face of reality, which is infinitely richer and more complex. That is obviously true — although, of course, it is hardly an effective argument against the scientific enterprise, pre-eminently abstractive as it is. If arguments like this one are embedded in wellwritten lofty disquisitions with a flavor of philosophy, however, there is a risk that they will captivate the reader; or, he may lose his way, that is, get farther away than ever from the formation of simple testable hypotheses. 2; 2; 3 Empirical exploration
In addition to systematic description and reflection (2 ;2; 1), and a methodical study of the literature (2;2;2), there are other aids to hypothesis formation. The investigator may, for instance, make fresh observations to 'explore' his subject in search of significant connections. If empirical materials are collected with the express aim of 'wresting ideas' from the factual data, or of finding out whether certain ideas will 'work out,' we designate such operations as empirical explorations, or as exploratory investigations. These explorations are distinguished from regular empirical testing by the fact that they are not conducted to test prestated, precisely formulated hypotheses. This does not necessarily mean that there are no hypotheses or theories involved, and particularly not that the investigator 1 For a vigorous bout of name-calling — it is hard to find another label for it — we again refer to SOROKIN (1956). A few choice specimens are: 'quantophrenia,' 'testomania,' 'the cult of numerology,' 'sham-scientific language,' 'sham objectivism,* 'senescent empiricism.' This is not to say that Sorokin's book does not contain a number of real arguments; the latter are mainly of the 'no-need* type.
2; 2; 3
51
2.
DESIGNING
THEORIES AND
HYPOTHESES
will not in fact have certain ideas and viewpoints. What it does mean is that data which have been collected in an exploratory fashion are neither intended nor suitable to serve the purpose of strict, scientific, hypothesis testing. Empirical explorations will vary a good deal in the degree to which the empirical data sought are clearly specified. The investigator may want to avoid all bias in surveying his field, that is, he may start his observations without any preconceived notions about the type of data and variables he is going to collect. Armed with no more than a general idea of what he wants to investigate and, naturally, with his scientific acumen, he will first let the materials 'speak for themselves.' That is to say, he will scan them for concrete data that may help him formulate his problem. Naturally, this approach again involves the risk that he will be confused, rather than enlightened, by the multifarious impressions received. This is why empirical explorations are rarely conceived with quite this amount of latitude. At the other extreme — for instance in a series of field experiments or extensive surveys — there will often be stringent advance decisions as to what variables are to be measured and what structural relationships are to be determined. This is where exploratory investigations assume the character of systematic inquiry. However, so long as they are not aimed at testing prestated, precisely formulated hypotheses or theories, they retain their 'exploratory' nature. It is of the utmost importance at all times to maintain a clear distinction between exploration and hypothesis testing. The scientific significance of results will to a large extent depend on the question whether the hypotheses involved had indeed been antecedently formulated, and could therefore be tested against genuinely new materials. Alternatively, they would, entirely or in part, have to be designated as ad hoc hypotheses, which could, emphatically, not yet be tested against 'new' materials. Whenever an investigation is partly designed for hypothesis testing and partly of an exploratory nature — which is a not infrequent occurrence (cp. 4;2;3) •— a strict differentiation should be maintained between these two elements. In particular, this applies to the publication of results. It is a serious offense against the social ethics of science to pass off an exploration as a genuine testing procedure. Unfortunately, this can be done quite easily by making it appear as if the hypotheses had already been formulated before the investigation started. Such misleading practices strike at the roots of 'open' communication among scientists. 52
2; 2; 3
2,2
M E A N S AND METHODS OF HYPOTHESIS
FORMATION
For a further discussion of the characteristics and methods of more or less independent exploratory investigations we must refer to 9;1;5. 2; 2; 4 Explorations of sample materials
Naturally, the processing of data obtained from an exploratory investigation, or from other sources, may itself be of an exploratory character. We shall accordingly designate as systematic exploration of sample materials that form of processing in which a given material is scrutinized methodically from different angles and by a variety of techniques, in quest of significant patterns of dependence which may be formulated as tentative hypotheses. If the materials are to 'speak for themselves,' they must be coaxed to reveal their significant properties by skillful processing. A host of techniques are available for this purpose. If the data are of a qualitative nature, the investigator may, by purposeful variation, suppress certain aspects and highlight others. He may employ coding and scaling to elicit variables that will bring out interpretable patterns of dependence. He may also attempt systematic interpretation, for instance by trying out, in juxtaposition, a number of 'tentative explanations' of the phenomena encountered in the material. In other words, he may line up a number of tentative hypotheses, possibly in the form of general formulas or mathematical models, and seek to eliminate those which are at variance with the facts. For this type of inferential process, John Stuart Mill's timehonored system of 'inductive logic,' for instance, will serve quite well (MILL 1952, bk. 3, Ch. 8; cp. also TOMKINS 1947, Ch. 4). If the material contains quantitative or categorized data, or if these have been obtained by coding and scaling, a large variety of statistical techniques, such as correlation, cluster and factor analysis, are at the investigator's disposal for the exploration of structural relationships. The characteristic feature of this type of processing remains that a number of hypothetical viewpoints are heuristically exploited in scanning the material. For purposes of hypothesis construction, particular attention is then given to those patterns of dependence which have 'worked out' well in the given material. To distinguish between those patterns of dependence that have, and those that have not, 'worked out,' that is, to compare the relative strength or preponderance of the various relationships tried, statistical tests will often be employed. This method has the advantage of providing an ob2; 2; 4
53
2.
DESIGNING THEORIES A N D
HYPOTHESES
jectively comparative criterion for the selection of what might be worthwhile from the viewpoint of hypothesis construction and testing. All the same, it remains an arbitrary criterion, which can at best be used loosely, to support an otherwise totally subjective and consciously interpretative choice. The warning in the preceding section about the necessity of maintaining a strict discrimination between exploration and testing must here be repeated even more forcibly. In explorations of sample materials, statistical tests, that is, appropriate calculations of 'significance,' may indeed be introduced, but they can in no way serve for 'hypothesis testing' in the sense of yielding a strict probabilistic interpretation of the (P-)results obtained. Not only have the relationships which are thus subjected to statistical tests not been antecedently formulated as hypotheses ; what is worse, having been produced through search and trying out, they have undergone a selection ad hoc. Whenever systematic exploration is employed to get 'every ounce' of content from the material, one will undoubtedly also get out 'an extra ounce' of any chance contents — and these are often indistinguishable from systematic contents. Accordingly, the risk of error, while inherent in all generalizations from sample findings, will in the case of explorations defy calculation by ordinary tests of significance. It is much greater, but precisely how much greater, it is often difficult to determine. Even highly 'significant' results obtained in explorations cannot be considered outcomes of hypothesis testing in its proper sense. They may result from a snowball effect of accidentals induced by the ad hoc selection of one (or a few) out of an indeterminable number of possible hypotheses which the investigator eliminated, many of them before trying them out — but after 'looking at the data' (cp. e.g. DE GROOT 1956b). The very nature of this kind of explorative analysis precludes elimination of this snowball effect ('capitalizing on error'). Once in a while, an investigator may be given a chance to process data which he has not himself collected. In such a situation, he will usually proceed in an exploratory fashion, that is, he will try a number of approaches to see 'what develops.' On the other hand, he may well pursue an altogether different course. Since he has not himself collected the materials (through exploration), he has the advantage of being 'uncontaminated.' To him the materials are 'new.' Consequently, if they are a truly representative sample of a universe1, and if he has hypotheses readily 1
Unfortunately, this condition is seldom fulfilled. Often the data are biased or incomplete owing to inadequate methods of fact finding. If the scientist who processes
54
2; 2; 4
2; 2
MEANS
AND
METHODS
OF HYPOTHESIS
FORMATION
available, or can construct hypotheses concerning this universe, there is nothing to prevent him from using the materials to test his hypotheses. Emphatically, again the limiting condition applies that the hypotheses must have been formulated in advance, that is, before he studies the materials, and explicit predictions must have been derived from them. 2; 2; 5 Methods of interpretation: ' Verstehen'; empathy
In the foregoing we have seen that interpretation is instrumental — both indispensable and functional — to the formation of hypotheses and theories (2;1;6). Hence, the important question of how one can, or perhaps should, proceed to interpret a given material 'legitimately' or 'fruitfully,' is highly relevant to the discussion of methods of hypothesis formation. At this point there is no need to examine what, from the viewpoint of strict methodology, constitutes a correct interpretation, and what criteria are to be imposed in this respect. To be sure, in everyday practice, particularly whenever problems of applied science demand a solution, this may be an all-important question. In the context of scientific research, however, the key question mostly concerns, not so much the correctness of specific interpretations, as the validity of the general hypotheses that were, or can be, constructed on the strength of them (cp. 2;1 ;6, footnote on p. 44). There are indeed exceptions to this rule, which will occur particularly when no generalization of the suppositions on which the interpretation is based can as yet be attempted, or when this is in fact inherently impossible; these exceptional, but important, cases will be discussed separately in 9;2. Thus we are left with the question of what methods may help ensure that an interpretation will bs fruitful-, that is, that it will lead to hypotheses which stand a fair chance of surviving later critical tests. In the following paragraphs just a few general suggestions concerning available procedures will be made. The first mention must go to systematic comparison, in which several pertinent interpretations are lined up and tried out successively. An analogous approach has been pointed out in the discussion of systematic the materials has the advantage of being uncontaminated, this is largely offset by the drawback that the collector was in the dark about the hypotheses that the former wants to test. So he could not organize his methods to suit the other's hypothesis testing. (A good example of this kind of processing is to be found in F R I J D A 1960.)
2; 2; 5
55
2. D E S I G N I N G T H E O R I E S A N D
HYPOTHESES
explorations of sample materials (2;2;4). Its characteristic feature is again that in the given material the consequences of each of the potential quantitative or qualitative interpretations are determined in this manner: If this particular interpretation were right, then the materials should exhibit such and such effects; whereupon a check is made to see whether these are forthcoming. An example of this approach is found in 'Management and the Worker' ( R O E T H L I S B E R G E R and DICKSON (1939) 1949, pp. 87-89, pp. 531-537 and if.), where it is used in interpreting the unforeseen occurrence of a progressively increasing hourly production in the Relay Assembly Test Room. In direct opposition to this method is consistent adherence to one particular interpretational scheme derived from the conception (theory, methods) of one 'school.' Admittedly, this approach is usually met with in applied areas. However, it is also possible to proceed on similar lines to see how far one can take a particular development or, in other words, to attempt the construction of further hypotheses within the framework of the given system. This implies that the system itself is not questioned and no attempts are as yet envisaged to test its validity. It is provisionally accepted as a 'working theory,' that is, as a system of related working hypotheses. An obvious advantage of this method is its consistency; an equally obvious disadvantage its one-sidedness. The disadvantage is a serious one and may well assume such proportions that, scientifically, the method is no longer acceptable. This will be the case when there is nothing but continuous theorizing all along the same lines, and no alternative theoretical models and interpretations are ever tried out. If such is indeed the case, the system is no longer a working theory that can be viewed with detachment, and which can still be tested. It is bound to develop more and more into a dogma of impenetrable complexity. Obvious examples both of acceptable and of disastrous applications of this method are furnished by theory construction in the various schools of depth psychology (for a discussion of psychoanalysis, see e.g. HEIDBREDER 1 9 3 3 ; F R E N K E L - B R U N S W I K 1 9 5 4 ; EYSENCK 1 9 5 3 , C h .
12,
and many others). In conclusion, a few words need to be said about an approach which will here be regarded emphatically as another method of interpretation for the purpose of hypothesis formation, although the claims that have been put forward on its behalf go a good deal farther. This method is 'Verstehen,' which in 1;2;2 we have translated as 'understanding': the 56
2; 2; 5
2; 2
MEANS AND METHODS OF HYPOTHESIS
FORMATION
gaining of insight into complex human or cultural phenomena. What is meant is not only sympathetic or 'empathic' understanding of the inner workings of a fellow creature, which is of course of unquestionable importance in social intercourse in general, and in medical, psychological and educational practice in particular (cp. e.g. R O G E R S 1951, pp. 28-29). It includes also incisive understanding with regard to more complex relationships such as cultural phenomena, human perspectives, human products, interactions and institutions. In this sense, 'understanding' plays an undeniable role also in the social and cultural sciences. It is difficult to give a detailed description of its methods (the most significant attempts in this direction are D I L T H E Y 1894;SPRANGER (1914), 1925, Ch.4; J A S P E R S (1913), 1959, 2. Teil 1,5). In any case, Verstehen can to some extent be learned — so that we seem justified in designating it as a method. For our purpose, it is essential that this method should not be regarded as an inherently autonomous, so-called 'geisteswissenschaftlich'1, alternative to scientific methodology, with which this book is concerned, but as being an integral part of it. As such, its rightful place is under the heading 'hypothesis formation.' If a process of Verstehen results in the construction of an imposing edifice of 'plausible,' 'insightful relationships,' these can by no means be equated with precisely formulated, tested and confirmed hypotheses. A plausible, insightful relationship is not the final word in scientific inquiry. If it is worthwhile, it can be a starting-point for further inquiry through a reformulation in terms of testable hypotheses and the testing of these hypotheses. The importance of empathic understanding in sciences like psychology and sociology is no small one, but its rightful place is in the first phase of the cycle (cp. A B E L (1948) 1960). One of the implications of assigning it this place is that psychoanalysis must be regarded as a system that is still, to a large extent, unscientifically formulated and untested. For, Freud's system, though inferentially of a high level of complexity, was built entirely on the strength of his empathic understanding of clinical cases and other materials (among which his own dreams). These materials were no doubt carefully processed, compared, and integrated. All the same, the fact remains that they have not been taken beyond the stage of interpretation by virtue of empathic understanding. His researches should therefore be classified as 'explor1
That is, relating to 'scholarship' (in the humanities, the disciplines of the mind (Geist), psychology included) rather than to exact 'science.'
2; 2; 5
57
2. D E S I G N I N G T H E O R I E S A N D
HYPOTHESES
atory investigations' and 'explorations of sample materials.' Precise formulations and critical hypothesis testing have never been performed by Freud and his followers. Moreover, psychoanalysis has suffered so often from the disability noted above — the onesided use of an interpretational scheme — that large tracts of his system are not 'theory' in our sense (cp. 2;1 ;5), but dogma which has become impervious to empirical scientific inquiry. These remarks are in no way intended to belittle the unquestionable importance of Freud's work. Our intention is solely to make clear what, according to the modern conception of empirical science, is the place to be assigned to it. At present, a great many projects are on foot, and a great deal of work has actually been done, to test various aspects of psychoanalytic theory (e.g., SEARS 1943; HILGARD, KUBIE, LAWRENCE, PUMPIAN-MINDLIN 1 9 5 2 ; WHITING a n d CHILD 1 9 5 3 ; BLUM 1 9 5 3 ; JANIS 1 9 5 8 ;
1962). To us, this means that Verstehen has been fruitful, and has demonstrated how it can function as an integral part of the scientific method. SARNOFF
2; 3 F O R M A L I Z A T I O N : P R O B L E M S O F C H O I C E 2; 3; 1 Language: verbal or mathematical
Freedom of design (2;1;2) also implies that the investigator is to some extent free to choose the form in which his hypothesis or theory is to be articulated. Of course, he must take care not to infringe the requirements for formulation which will be set out in the next two chapters (see particularly 4;3;4 and 4;3;5). Even so, he has some scope for individual choice. Its extent will vary according to the character of the factual underpinnings and of the theoretical framework he has adopted for his inquiry. If his work is directly in line with theories and investigations of others, the margin of choice will be narrow. Similarly, if his work can readily be molded in an established terminology or in an existing mathematical formulation, his model is virtually ready-made. If, on the other hand, he is concerned with original theory construction in a relatively unexplored area, or if he endeavors to design a new or modified model for investigating an old problem, he has a larger measure of freedom. We shall consider some aspects of this freedom in greater detail, starting out with external formalization: the choice of language, and of symbols and terminology. 58
2; 3; 1
2; 3
FORMALIZATION:
PROBLEMS OF
CHOICE
As regards language, the major alternatives are verbal and abstractsymbolic (or mathematical) models. If the investigator is dealing with precisely stated functional relations between quantitative variables, a mathematical formulation is his obvious choice. If, on the other hand, the model is to serve for a general heuristic exploitation of a complex of functional relationships that cannot be satisfactorily stated in terms of relations between measurable variables, a carefully designed verbal form will be most adequate. Intervening are many cases in which the investigator can to a certain degree select a form to suit his own taste. Usually, verbal models for theories and hypotheses are associated with a primitive stage in the scientific attack upon an area of inquiry. Particularly in sciences like psychology and sociology, many investigators find mathematical formulations very attractive, and tend to invest them with higher status. This view frequently finds expression in the assertion that measurement and quantification are the only suitable tools for a science worthy of the name. Thus, to cite just one example, Cattell heads the first section of the book dealing with his extensive personality investigations: 'Realistic Theory Rests on Measurement' ( C A T T E L L 1957). So long as there are no formulas, it is felt, there can be no real science. Undoubtedly many examples could be listed to show that in the history of science the initial stages of theory development were verbal, and that the real breakthrough did not come about until quantification was introduced. But we must beware of hasty generalization. For one thing, this is palpably untrue of cultural sciences like history and philology, which also employ theories and hypotheses, although their role is less prominent than in the natural sciences. Furthermore, as far as the more exact sciences are concerned, discussions on this score mostly take into account only the successful examples — Galilei, Kepler, Newton, Faraday, etc. The no doubt numerous, mathematically formulated models that have proved erroneous or irrelevant, and which are now buried under the dust of ages, are conveniently ignored. Quantification and mathematical expression of theories and hypotheses are not in themselves valuable. They can be so only if what is measured and expressed in formulas enables the scientist to get a firm, theoretically relevant and productive, grasp on reality. Now, in sciences like psychology and sociology it is often so difficult to produce relevant quantifications that in many cases it is better to be content with a verbal theoretical model, rather than to run the risk of 2;3; 1
59
1. D E S I G N I N G T H E O R I E S A N D
HYPOTHESES
being led astray by the no doubt brilliant example of physics (compare again Robert Oppenheimer's address to A P A psychologists, OPPENHEIMER 1956, pp. 127-135). Nor should the possibilities of achieving precision in a carefully worded, well-integrated verbal model be underestimated. After all, it was to express the precise and inexorable character of causal relations found in nature that man came to adopt the term 'laws of nature' — on the analogy of law in the legal sense, which is expressed in ordinary words. One famous example may suffice: Charles Darwin's non-formalized, verbal theory of the origin of species (DARWIN (1859) 1929). An argument that may be adduced against a verbal system such as Otto Selz's theory of thinking is that a large proportion of the statements serve mainly to set up, through definition and codification, a more or less fixed abstractive and descriptive terminology, while relatively few statements lend themselves to empirical testing (cp. V A N P A R R E R E N 1953, p. 433). In other words, the verbal theory is for the most part a descriptive and definitional system and frame of reference, and but to a very minor extent a theory in its proper sense, from which testable hypotheses can be derived (cp. e.g. ZETTERBERG 1954, p. 10). On the other hand, this may in fact be necessary for a systematic exploitation of the subject matter under inquiry. Moreover, such opportunities for testing as the theory does afford may well be of fundamental importance (cp. D E G R O O T 1954b, pp. 118-119). Sometimes, indeed, the same phenomenon — an extensive logical model having but few links with overt empirical procedures — may be found also in mathematically formulated theories of comprehensive scope and import, such as the general theory of relativity. Also, it is true, in general, that (STEVENS (1939) 1956, pp. 44-45): 'an astonishing number of the scientist's sentences are syntactical in this sense,' namely, in the sense of providing stipulative definitions that are instrumental in establishing scientific usage. Naturally, the difficulty of keeping verbal models both logically integrated and sufficiently precise is greater than in the case of abstract-logical or mathematical ones. Frequently a helpful expedient will be to 'translate' the essence of a verbal model or to construct a precise symbolic-logical model for a theory (see e.g. W O O D G E R 1937). A particularly instructive discovery to be made in such efforts is, for that matter, that difficulties in the way of translation result partly from loose phrasing and redundancies, verbal 'frills,' in the original formulation, but partly also from 60
2; 3; 1
2;3 F O R M A L I Z A T I O N : P R O B L E M S O F C H O I C E
the efficiency of ordinary turns of speech, which can be transformed only through elaborate formalizations. It may be noted in passing that the fact that such translations can actually be made in some cases brings out again the investigator's freedom of choice. 2; 3; 2 Selection within one language
It should be fairly obvious that even within one form of language there is often room for a number of equivalent presentations. If an axiomatic method is employed ( B O C H E N S K I 1954, Ch. 4; ( T A R S K I 1941) 1949; an instance from social science in ZETTERBERG 1954), it is generally possible to choose which statements will be regarded as fundamental postulates, and which as derived theorems. Diverse mathematical models may turn out to be equivalent through far-reaching formal analogies — the classic example in physics being the Schrodinger wave equation and the Heisenberg 'picture' in quantum mechanics (see e.g. R E I C H E N B A C H 1951). In the verbal type of theory or hypothesis, the possibility of many equivalent presentations of the same model is evident. A problem by itself may be the selection of terms or symbols to denote newly introduced constructs or variables. Here, the investigator has a great deal of freedom indeed. The days are definitely past when scientists thought that notions derived from everyday speech have a precise, basic meaning, which must be discovered, for instance through phenomenological analysis, before they were 'permissible' in scientific usage.1 Whenever the investigator believes that the 'core domain' (VON MISES 1939) of a common speech notion is scientifically useful, and its 'area of indeterminacy' is not unduly large, he is at liberty to use it; on the express understanding that he will define it more precisely as soon as this is necessary. This may be done by verbal delimitations from other notions; by stipulating 'structural' relations to other concepts (EINSTEIN 1944); by specifying empirical criteria for its applicability (operational definition, see 3;3;4); by stipulating postulates — or by a combination of these methods. On the other hand, the investigator may use neologisms or abstract symbols, which are often said to have the advantage of being unburdened by old meanings or connotations, so that there is less risk of 1 Still, it seems but a few short years ago that, for instance in psychology, discussions were regularly conducted on this basis. Compare the relevant passages on such notions as 'language,' 'tool,' 'enjoyment,' 'insight' in DE GROOT 1944.
2; 3; 2
61
2. D E S I G N I N G T H E O R I E S A N D
HYPOTHESES
ambiguity. But again, this is a matter of choice, which will be determined also by the investigator's individual preferences, and by his intentions and pretensions. The intentions and pretensions with which the investigator introduces a theoretical model may vary widely. To some extent, this is a matter of what attitude the investigator adopts towards his own theory or hypothesis. Individual attitudes may range all the way from lifelong dedication to a particular theory, combined with a passionate belief in its truth, to detached, almost sporting, experimentation with what is at most held to be a provisional theoretical or hypothetical solution. Neither of these extreme attitudes is objectionable, and the same applies to all intermediate forms, provided the investigator complies with the rules of empirical scientific inquiry. Both 'toying' with models (cp. 9;3;3 and 9;3;4) and a touch of monomania, or at least unswerving perseverance in exploiting one idea, can be valuable. Examples of the latter, 'heroic,' type of investigator are so well-known — Darwin is a classic case — that further illustration seems hardly necessary. Suffice it to note that we have thus glanced briefly at another instance of the investigator's freedom of choice —• within the limitations imposed by his dependence on facts and the existing theoretical framework in the area under inquiry. An intermediate form that deserves special mention is one in which a model is provisionally accepted as a working theory or working hypothesis. Here, the investigator, while actually maintaining his reserve, nevertheless adheres consistently to a given model throughout a series of investigations. This, too, is unquestionably an acceptable attitude, and it may in fact be extremely productive. Needless to say, the condition applies that the investigator must not pay mere lip service to the provisional character of the model as a working tool, but must actually subject it to critical tests. In other words, if the investigations serve exclusively to elaborate the model, and the model itself is not exposed to such forms of critical testing as will render it falsifiable (cp. 4;3 and the observations on psychoanalysis made above in 2;2;5), then it is no longer a working theory but unacceptable dogma.
2; 3; 3 Tentative or definitive
62
2; 3; 3
2;3 F O R M A L I Z A T I O N : P R O B L E M S O F C H O I C E
2; 3; 4 General or specific
Still another aspect concerns the general character of the theory or hypothesis. We say that theory (hypothesis) A is 'more general' than theory (hypothesis) B, if B can be designated as a special case, a sub-theory (or sub-hypothesis) of A. For instance, in physics: the kinetic theory of gases (B) and atomic theory (A), or, the theory of the refraction of light (B) and optics (A). Likewise, in the social sciences we can, for instance, distinguish between 'miniature' theories and 'comprehensive' theories (ZETTERBERG 1954; MERTON (1949) 1957, pp. 5-10), or between general hypotheses and specific ones, which may be held to relate solely to, for instance, a particular form of society, a particular community or institution, or to a particular subgroup of individuals. Miniature theories or specific hypotheses may deal with such specific topics as — in the psychology of perception — the perception of line configurations, binocular perception, constancy phenomena, etc. (cp. WOODWORTH and SCHLOSBERG 1955); or, problems of locally restricted range and import, for instance — in a very different field — the relations between authority and compulsory marriage partnering in some primitive communities (HOMANS and SCHNEIDER 1955); or, the validity of a given test program for particular selection purposes. Theories or hypotheses purporting to be of general validity are, for instance, psychoanalysis, and within it such a hypothetical notion as the Oedipus complex (MULLAHY 1955); general theories of decision making (cp., e.g., E D W A R D S 1954); general theories of the business cycle (for a discussion see WITTEVEEN 1956); Toynbee's theory of the rise of civilizations (TOYNBEE 1957). In intuitive judgments, the 'general' nature of a theory is closely tied up with the scope of application claimed for it. In seeking a criterion for determining whether a theory is general, the most pertinent question to ask is: what universe does the theory claim as its area of unrestricted applicability ? In principle, model B can then be a sub-theory (sub-hypothesis) of model A in one of two ways. First, by virtue of the fact that the universe of individuals or phenomena to which B applies is a subset of the universe of A. Secondly, because the elements of A are themselves collections, and the universe of B is one of these. In other words, B is either a specification (cp. 3 ;3) or a systematic — though not (yet) of necessity a deductively derivable — part of A. Admittedly, this is not a watertight criterion, since the answer may depend on the logico-systematic classification adopted; A may be more general than B in one respect, but more specific 2; 3; 4
63
2.
DESIGNING
THEORIES AND
HYPOTHESES
in another. At all events, some guidance may be derived from the question as to what universe is stipulated. In this respect, the investigator's pretensions may vary a good deal too, and this will not depend only on his choice of subject matter — in which, of course, he is free as well. Once he has designed his model, he may again claim various ranges of application for it. Unfortunately, the question asked above is all too often left unanswered in the social sciences. Thus, psychologists frequently fail to state, for instance, whether a given hypothesis is intended to apply to all mankind, or only to certain groups or sections such as adults, occidentals, males, or only to certain subgroups, for instance, university students, or only in a particular period of a certain culture. Without prejudging the outcome of the discussions in the next two chapters, it is convenient to state here that the investigator is not free to leave such matters undecided (cp. 3;1 ;5). 2; 3; 5 Complex or simple
A theory, viewed as a system of concepts and definitional relations, may exhibit a greater or less degree of complexity. This point again is intimately bound up with one of the requirements to be imposed on hypothesis formulation, the economy principle (or 'parsimony law,' cp. 3;1;3), which states that, other things being equal, the simplest model is always the best. Among the 'other things' that are not of necessity 'equal,' are again the investigator's intentions and pretensions. Again, he has some freedom of choice. He is free to construct a complex — and hence more pretentious — model, though obviously this will place him under heavier empirical obligations. An important question, which was touched on in our discussion of the choice of language (2;3;1), is that of the extent to which the model is to do duty as a definitional and descriptive frame of reference for purposes of further theory construction. If it is intended as such, the investigator will have to formulate a fairly extensive range of concepts. The more concepts, and the greater the number and/or the complexity of the definitional relations among them, the more pertinent will be the question of what linkages the various parts of the system have with observables, and what the empirical procedures are to arrive at testable statements. A corollary of increasing complexity of the system of concepts and their interrelations is, in general, a growing 'distance' separating the theory from directly observable empirical facts. This, also, is a variable 64
2; 3; 5
2; 3 F O R M A L I Z A T I O N : P R O B L E M S OF C H O I C E
characteristic of unquestionable importance. If the investigator seeks merely to express a straightforward empirical relationship in words or in a formula, a limited system of concepts will mostly suffice. If, on the other hand, he engages in theorizing about causal relationships, and about structural dependencies among various empirical connections within an extensive area of phenomena, then he will be faced with real theory construction — in terms of constructs exhibiting a higher degree of abstraction. This brings us to our last variable characteristic, the abstractive 'distance' between the constructs in a given theory and the corresponding observables. In 2; 3; 1 we have stated that the investigator has a certain 'freedom of concept formation,' but this point is so essential and has received so much attention in the literature that a more detailed discussion is called for. 2; 3; 6 Hypothetical constructs
Scientific usage distinguishes several types or categories of concepts which enter into the formulation of hypotheses and theories. An important, frequently employed distinction is that between empirical and hypothetical concepts; the latter are often designated as hypothetical constructs (cp. MCCORQUODALE and MEEHL (1948) 1956, p. 110). The difference is basically one of gradation, that is to say, the distinction is based on the degrees of abstraction employed in deriving empirical concepts and hypothetical constructs from directly observable facts. Another way of saying this is that the two categories can be distinguished by the different number of inferential or elaborative steps that either requires, to establish a clear linkage to empirical fact. Although it is indeed possible to list characteristics of hypothetical constructs not found in empirical concepts and vice versa, these will still not yield any absolute criteria to discriminate between them, but only relative ones ( B E R G M A N N and SPENCE (1941) 1956, p. 59; M A R X (1951) 1956, p. 114; FEIGL 1956, pp. 16-18). Whereas empirical concepts serve to 'provide a convenient shorthand summarization of the facts' (HULL, as quoted in MCCORQUODALE and MEEHL (1948) 1956, p. 107), and are fully represented by variables whose value can be measured or computed from empirical observations, hypothetical constructs are farther removed from empirical reality. They usually assume the 'existence' of some substratum, object or agency, or of a process or occurrence which itself cannot be directly observed (MCCORQUODALE and MEEHL (1948) 1956, p. 104; see also HEMPEL 2; 3; 6
65
2. D E S I G N I N G T H E O R I E S AND HYPOTHESES
1958)1. Examples are, in physics: the 'resistance' of a wire as against the 'atom' or 'electron'; in psychology: the 'reaction time' of a subject as against his 'inferiority complex'; in sociology: 'population density' as against the 'degree of urbanization'; in economics: the 'national income' of a nation as against its 'prosperity.' Naturally, not all notions are so easy to classify as these extreme examples — a notion like 'intelligence,' for instance, may be construed either way — but the general nature of the distinction will be clear. 'Hypothetical constructs' have become something of a household word in psychology — where they are often compared and contrasted with 'intervening variables' (a term coined by TOLMAN 1936). Perhaps the gist of the distinction is brought out most clearly by these two terms: abstracta and illata (REICHENBACH 1938). Empirical concepts are derived from observational data by direct abstraction; hypothetical constructs, on the other hand, are inferred by reasoning (L. infero, illatum) or 'hypothesized.' While empirical concepts have never met with much serious criticism, the philosophy of science has at times been convulsed by debates on the admissibility of hypothetical constructs. To curb the above-mentioned vague theories, unrestrained conceptualizations, and unverifiable generalizations rife in some sciences (cp. 2;1 ;3), criteria have been sought that would make it possible to outlaw, if not all (as was at first envisaged), then at least certain kinds of hypothetical constructs, or a particular use of them. These efforts were aimed mainly at constructs with an unverifiable surplus meaning (likewise a term coined by Reichenbach) such as 'libido' and 'super ego' in psychoanalysis, and numerous others. Such constructs imply far more than can be made explicit through their linkages to empirical findings; moreover, this surplus meaning is poorly outlined, while it often plays an important but unverifiable role in the further development and application of the theory. Frequently, the surplus meaning will arise from the metaphorical character of such constructs; there is unavoidable interference from the entire figure of speech, perhaps most when it is least opportune. Although, nowadays, there is a measure of consensus about the danger 1 In terms of Torgerson's distinction between 'systems' and 'attributes,' hypothetical constructs often stand for supposed objects (systems), which may indeed exhibit measurable properties (attributes) — that 'belong' to the system — but which cannot themselves be reduced completely to a specified set of observable (measurable) attributes (cp. t o r g e r s o n 1960, p. 9, and in the present book, e.g., 3; 3; 5).
66
2; 3; 6
2; 3 F O R M A L I Z A T I O N : PROBLEMS OF CHOICE
to the scientific enterprise resulting from such constructs, there can be no reasonable doubt that all efforts to formulate precise criteria for their proscription have so far failed (FEIGL 1956). This is brought home by recent recommendations. Thus, MARX (op. cit., 1956, p. 118): 'Probably the only real solution is a continuing pressure on the users of constructs and the developers of theory to improve the operational validity of their formulations'; MCCORQUODALE and MEEHL (op. cit., p. 110): 'We would argue that dynamic explanations utilizing hypothetical constructs ought not to be of such a character that they have to remain only metaphors.' Another frequently expressed view is that hypothetical constructs with a surplus meaning (often of a metaphorical character) are permissible, but only in the 'initial stages' of theory development (e.g. MARX (1951) 1956, p. 114 ff.). It will be obvious that these are far from clear-cut criteria. For instance, there is really no way of knowing beforehand whether a given construct will have to remain only a metaphor — unless facts can be adduced that are obviously at variance with its metaphorical content. Nor does the restriction to 'initial stages' have any practical significance. If there is a use for the construct initially, it will probably survive these early stages, and there will be no need to replace it with another concept, since with increasing empirical corroboration the (metaphorical) surplus meaning will naturally wear away. A good example is afforded by the development of the concept 'atom' (originally a concrete little ball), and in a way by the concept of 'intelligence' in psychology (cp. 4; 2; 4). The conclusion that no clear-cut criterion can be formulated is in accord with the principle of freedom of concept formation1 stated above (2;1;3). Methodological requirements can be imposed only on the manipulation of the concept in the formulation, deductive specification, and testing (phases 2, 3 and 4) of theories and hypotheses. The problem of empirical concepts versus hypothetical constructs is here seen mainly as 'the theoretician's dilemma' (HEMPEL 1958), as a problem of choice confronting the investigator. He may choose to stick closely to the 'facts' or, conversely, to attempt more widely inclusive 'ideas.' He may seek to deduce strictly empirical laws, building up a scientific system step by step, starting 'at the bottom.' On the other hand, he may try a theoretical approach of a bolder, more comprehensive 1
A discussion of the advantages of constructs with a surplus meaning, which is in good agreement with the view put forward here, may be found in ROMMETVEIT 1955, 1957, in opposition to SAUGSTAD 1956, 1957.
2; 3; 6
67
2.
DESIGNING
THEORIES AND
HYPOTHESES
sweep, and thus attempt to get a grip on causal relationships in reality. Both methods are legitimate and of vital importance for the progress of science. Contingent on this choice will be the investigator's need of a greater or smaller number of hypothetical constructs. The choice itself is free, and it is senseless to reproach others for choosing differently. The question of what requirements are to be imposed on the manipulation of a concept or construct within the context of the scientific process will receive ample attention in the chapters that follow.
68
2; 3; 6
CHAPTER
3
F O R M U L A T I O N OF T H E O R I E S AND
HYPOTHESES
A. T H E D E D U C T I V E P R O C E S S 3; 1 N O R M A T I V E FOR
STANDARDS
FORMULATION
3; 1; 1 Antecedent formulation
The preceding chapter has argued that, while the manner in which the investigator arrives at a theory or hypothesis can indeed be fruitfully discussed in terms of description and recommendation, it cannot be made subject to any strict rules. Now the question arises as to what requirements must be imposed on the formulation in the phase of induction —• i.e., on the end result of the inductive process — no matter how this has been obtained, and irrespective of whether it is an extensive theory or an isolated hypothesis. Naturally, these requirements will be of immediate relevance to the processes of deduction, testing, and evaluation, which by rights must follow formulation in the cycle. If an investigation into certain consequences of a theory or hypothesis is to be designed as a genuine testing procedure (and not for exploration), a precise antecedent formulation must be available, which permits testable consequences to be deduced. The following four principles, for the most part familiar from the literature, stipulate certain conditions to which such formulations must conform. Their import is so obvious that at this stage they require little comment.
3;1;2 Logical consistency
A theory presents a logico-conceptual (possibly mathematical) model for the structural regularities or laws governing the phenomena in an area of reality; and this model must be non-contradictory (cp. 2;1;5), i.e., free from inconsistencies. It must be impossible for two or more consequences logically 3; l ; 2
69
3.
F O R M U L A T I O N : A. T H E D E D U C T I V E
PROCESS
deduced from the same theoretical system to be mutually contradictory. As a basic principle, this requirement is self-evident. Its practical application, however, may be fraught with uncertainties, particularly in the case of verbal, non-mathematical models involving relatively vague concepts and rules of inference. We are here confronted with a consequence of our comparatively broad and tolerant description of what constitutes a theory (2; 1; 5), which did not ban indefiniteness altogether, or at least did not demand ideal explicitness. When we come to discuss the principle of testability, we shall see that, as a matter of fact, certain minimum conditions of explicitness must be met (3; 1; 4 and 4; 3; 1), but that these do not rule out partial indeterminacies. As a result, it is often no easy matter to determine whether or not a statement is really a logically derivable consequence of the theory; nor, accordingly, whether the fact that two such statements are mutually contradictory discredits the theory. Nevertheless, this principle has a certain practical, in part preventive, in part corrective significance also for non-formalized theories. As soon as it is possible to deduce conflicting statements, even though this should be through an interpretation not intended by its designer, it is evident that the formulation is deficient. Whether or not a genuine inconsistency is involved can only be decided by attempting a more precise and specific formulation. If the inconsistency can thereby be removed without impairing the theory in other respects, so much the better for the theory. All the same, the criticism has been warranted — and it has achieved a positive result in the form of an improved formulation of greater explicitness. Searching for inconsistencies in theories is accordingly to be classed as constructive criticism, no matter whether the inconsistencies are real or merely apparent. Fairly frequently, this type of critical discussion will arise from the fact that an identical concept in a theoretical exposition is demonstrably employed in more than one way. By working out the consequences of consistent application of one function (definition) of the concept, the critic may then be able to demonstrate that it is at variance with the other(s). Even in the absence of such deductive proofs, it will be obvious that a demonstrable disparity in the use of an identical term in one and the same theoretical context constitutes a logical deficiency, which may lead to contradictory statements. Examples of this type of discussion are Wijngaarden's comments on the concept of 'the self' in Rogers', avowedly 70
3; 1; 2
3; 1 N O R M A T I V E S T A N D A R D S FOR F O R M U L A T I O N
preliminary, personality theory (ROGERS 1951; WIJNGAARDEN 1958) and De Groot's criticism of Van Parreren's use of such terms as 'logical thinking,' 'rational,' and others (VANPARREREN 1953; DEGROOT 1954b). 3; 1; 3 Principle of economy
The logical model presented by the theory must be as simple as possible. The purpose of a theory (or hypothesis) is to account for certain phenomena in such a way that they can be accurately predicted. The theoretical formulation which performs this task with the smallest number of basic concepts, and the simplest assumptions, is in general the best. This criterion implies that one must practice economy in introducing concepts — a fortiori hypothetical constructs (cp. 2; 3; 6) — and assumptions. Over the centuries, this principle has been affirmed by different authors under different names — with certain variations in meaning, it is true, but always with the same basic intent. 'Occam's razor' is presumably the oldest known form. In the current literature it is most commonly designated by such terms as 'the principle of economy,' 'parsimony law,' and 'simplicity.' It may be applied to a theory, irrespective of its empirical linkages, as a requirement for 'systematic simplicity' (COHEN and NAGEL 1934, pp. 214-215), that is, economy of the number of assumptions relative to the degree of interrelatedness of these assumptions.1 Feigl calls this 'formal simplicity,' as distinct from 'inductive parsimony,' that is, the simplicity of the theory relative to its (inductive) explanatory power (FEIGL 1956, p. 14;. It is particularly this last aspect which is of great importance in applying the canon of economy to the behavioral sciences. The principle is not merely concerned with aesthetic refinements such as the mathematical 'elegance' of a model. A theory or hypothesis that contains redundancies is also unpractical, particularly if it is to serve as a basis for further exploitation. Also, the principle of economy has a direct bearing on our next point. What is formally superfluous is also unnecessary for the deduction of testable hypotheses and predictions from the model; it lacks empirical linkages, is irrelevant to testing, and 1 There is a connection between this (relative) 'systematic simplicity' and that absolute ideal of integration which in the logic of the deductive sciences is called 'sufficiency.' Its criterion is that 'each sentence formulated in terms of the theory must be capable of confirmation or refutation'; truly an ideal of interrelatedness (TARSKI (1941) 1949, p. 38).
3;l;3
71
3. F O R M U L A T I O N : A. T H E D E D U C T I V E
PROCESS
as such non-testable (cp. 4; 3; 1). This consideration enables us to gather formal and inductive simplicity together again in this formulation: a theoretical model should be as simple as possible as regards its empirical pretensions relative to the opportunities for testing which it affords. The principle has often proved its great worth, for instance in psychology as a means of challenging unwarranted anthropomorphism in theory construction concerning animal behavior. Conversely, however, inexpert and unduly rigorous enforcement of the principle has often hindered the development of science. 3; 1; 4 Testability
A theory must afford at least a number of opportunities for testing. That is to say, the relations stated in the model must permit the deduction of hypotheses which can be empirically tested. This means that these hypotheses must in turn allow the deduction of verifiable predictions, the fulfillment or non-fulfillment of which will provide relevant information for judging the validity or acceptability of the hypotheses. This principle seeks to ensure that the theory be based squarely on overt empirical procedures in at least a number of places. There must in each case be a number of points on which precise investigative procedures can be instituted, capable of yielding such results as will permit the theory to be critically evaluated. The tenor of this principle is self-evident: the 'truth' or 'value' of a theory or hypothesis pertaining to reality can be assessed only through empirical testing. If the theoretical formulation is such that no testable consequences can be derived from it — as is the case with 'metaphysical' systems — then, as already stated in 2; 1; 5, it is 'not a theory in the (empirical scientific) sense understood here.' Presented in this form, the principle of testability constitutes an absolute requirement, which merely stipulates a minimum standard. However, like the principle of economy, it may be employed in a relative sense, as a variable index of merit. The more opportunities for testing, particularly of fundamentals, a theory affords, the better, ceteris paribus, it is in comparison with other theories (or hypotheses). We shall see later (cp. 4; 3; 5) that the crux of the matter is the 'risk of refutation'incurred: the more a theory 'risks,' the more it 'stands for' 1 , 1
The more potential results a scientific law rules out — it is not for nothing that we speak of 'laws' of nature — the more it states (POPPER (1934) 1959, p. 41).
72
3; l ; 4
3; 1
NORMATIVE
STANDARDS
FOR
FORMULATION
and the greater its value will be — if it proves itself equal to the risk. Another important aspect, as has been pointed out before, is the relation between testability and economy. A theoretical formulation which on the face of it is deficient in economy may be justified by the possibilities it offers of opening up new areas of testing; the model is then to be regarded as more 'pretentious' (cp. 2; 3; 4 a n d 2 ; 3; 5). Conversely, a pretentious, indifferently economical theory, operating with, for instance, many hypothetical constructs, should afford more opportunities for testing than a straightforward empirical hypothesis; the requirements of testability are stepped up. 3; 1; 5 Stated empirical reference
The formulation of a theory or hypothesis must contain a precise indication of the collection's) of empirical phenomena to which it is supposed to relate. The designer will have to state what his intentions are, and what area of application, what universe, he claims for the theory or hypothesis (cp. 2; 3; 4). In effect, this principle states no more than that the investigator must clearly outline his empirical intentions and pretensions. It is certainly no less self-evident than the other three. Perhaps it is precisely because it is so self-evident that is is often not included among the standard requirements1. In the behavioral sciences, however, hypotheses are all too often published in general terms, without any reference to the population to which they purport to apply. It seems opportune, therefore, to formulate an express demand for a 'stated empirical reference.' These principles need to be further elaborated. This, however, cannot be done, and their meaning and practical application cannot be clarified, until we have made a closer study of the processes of deduction, testing, and evaluation, which must follow the formulation of the theory or hypothesis.
1
This principle was included as a result of a suggestion (made in 1958) by D U I J K E R (1960, XVI, p. 74). It presents what, by reason of our tolerant conception of 'theory,' is in effect a watered down version of the principle of operationalization, which requires that all primitive terms in a formal system be provided with precise operational definitions (cp. 3; 3; 4). H.C.J.
3; 1;5
73
3.
FORMULATION:
A. T H E D E D U C T I V E
3;2 D E D U C T I O N A N D
PROCESS
SPECIFICATION
3; 2; 1 From general to particular
The deductive phase is characterized by a deductive mode of reasoning. This means that, in contrast with the inductive type of argument, which seeks to pass from particular statements of fact to more general statements, the reasoning process here consists in deriving either similarly general, or more specific (or more 'particular') statements from one or more given statements. It is possible to distinguish between these two cases: deduction of a similarly general statement (g for general), and: deduction of a statement involving a 'particularization,' i.e., a transition from the general to the more particular (p). Furthermore, the derivation of a consequence in the deductive process may either be of a strictly deductive, purely logical character (/), or constitute an empirical specification (s). This last type (s) will occur particularly in unidirectional specifications of the manner in which a concept, a construct, or a variable is to be identified or measured (operationally defined). When these two dichotomies are combined, four cases can be distinguished: gl, gs, pi and ps. All four appear in the following example.
Suppose the hypothesis is: boys are in general more intelligent than girls. 'In general' is intended to denote a statistical relationship between sex and intelligence in the sense indicated — of course, the implication is not that any one boy is more intelligent than any girl. To test such a hypothesis, the investigator will as a rule formulate a null hypothesis, which he will then seek to disprove or, at any rate, to reject on valid grounds. The null hypothesis in this case might run something like this: In the population of 'all' children (which requires further specification) there is no relationship between sex and intelligence. That is to say, the distribution of intelligence among boys is the same as among girls. Now the argument may conceivably proceed like this: If the distributions are identical, then there must be proportionally as many 'intelligent' girls as boys, assuming that we divide all children — according to some prefixed empirical criterion — into two groups: 'intelligent' and 'non-intelligent.' This is a logical deduction (/), but at the same time a particularization (p). The proposition cannot be converted: the identity of the distri74
3; 2; 1
3; 2 D E D U C T I O N A N D
SPECIFICATION
butions of the intelligence variable for boys and girls, which is in general conceived as being approximately continuous, does not follow from the identity of the relative frequencies for one dichotomy: intelligent/nonintelligent. It is quite conceivable that the population would in fact exhibit a difference, if another dichotomy, for instance 'highly intelligent' children versus the rest, were applied. The inferential step therefore marks a transition from the general to the particular. Type : pi. If the preceding proposition holds, and if in the population the relative frequencies of boys and girls are pi and qi respectively (pi + qi = 1), while the relative frequency of 'intelligent' children is p2 and that of 'non-intelligent' children is q2 (p2 + q2 = 1), then the population will show the following relative frequencies: intelligent intelligent non-intelligent non-intelligent
boys girls boys girls
pip2 qip2 piq2 qiq2
This, logical, inferential step entails no reduction of the general character of the proposition. What we knew or had assumed before is expressed in a different form; but the new form states no less (nor more) than the old one. Type gl. To subject our hypothesis to empirical testing, we must fix a method to determine sex; for instance, a medical examination, a written statement by the child itself, or by its teacher. The first method is no doubt the most exact, but the other two will in general be considered adequate for the distinction intended. In other words, the inferential step: If the proposition holds for sex in its intended sense, then it must hold also for the sex as actually determined, entails so little loss of generality that it is negligible. Type: gs. If the proposition holds for intelligence in its intended sense, then it must hold also for the intelligence as determined by test X, in which, for instance, an IQ of 101 and upwards counts as 'intelligent,' 100 and below as 'non-intelligent.' Since in general the concept 'intelligence' will be regarded as not entirely equivalent to 'the IQ obtained in test X' (cp. 3 ; 3 ; 5 and 8 ; 2 ; 3), a transition to a more specific proposition is most certainly involved. Type : ps. 3;2;1
75
3. F O R M U L A T I O N : A. T H E D E D U C T I V E
PROCESS
Unlike case pi, the specific inference in ps is not logically 'necessary.' In our instance it is quite conceivable that a given choice of X will draw forth objections; some tests, for instance, are reputed to 'favor boys (or girls)' above the other sex (cp. ANASTASI 1958, esp. Ch. 14). We shall not now pursue the deductive elaboration of our hypothesis (but cp. 3;2;3 and 3;3;4, and Ch. 5); our purpose has been only to demonstrate the four types. Whenever in the sequel a distinction to this effect seems called for, we shall designate types gl and pi as logical steps and types gs and ps as specification steps. For the present, our concern is mainly with those steps in the deductive process which involve a transition from the general to the particular, that is, with types pi and ps. In the social sciences, specific inferences, particularly of the latter type (ps), which involve empirical specification as well, will usually be inevitable if verifiable predictions are to be derived from testable hypotheses. As a result, the deductive process in the third phase of the cycle generally is, as a whole, a process of step by step specification of the suppositions originally contained in the theory or hypothesis, until finally a concrete prediction concerning the outcome of a testing procedure is produced. The process may comprise a relatively large or small number of steps, varying with the logical 'distance' separating the theory from the prediction. In our terminology we distinguish only this triad of basic concepts: the theory — as a system of concepts and assumptions (cp. 2; 1; 5), from which testable hypotheses can be derived; the hypothesis — to be designated as a supposition concerning a regularity in, or an interdependence among, categories of phenomena in reality, from which concrete predictions can be derived; the prediction of concrete observational findings or outcomes of data processing. The number of steps in the process of particularization will mostly exceed two. The distinctions will then embrace, for instance, general and more specific hypotheses, or in some cases theories and sub-theories.
76
3; 2; 1
3; 2
DEDUCTION
AND
SPECIFICATION
3; 2; 2 Theory, hypothesis, prediction: distinctions1
The definitions of theory and hypothesis given above show that the distinction is not a fundamental one. So long as a hypothesis is capable of, and requires, further elaboration in terms of more specific hypotheses for the purpose of testing, it is still itself of a complex character, and might be called a 'theory.' The difference is one of degree: a hypothesis is relatively simple, a theory more complex. This rule-ofthumb criterion might be given (for what it is worth): in general a hypothesis can be summed up in one sentence, whereas a theory cannot. Dictionary definitions of 'hypothesis' tend to stress its 'provisional' character. That is all very well for everyday usage, but in our conception this is not an essential characteristic and does not distinguish a hypothesis from a theory. True, from the viewpoint of the practical application and further exploitation of a supposed structural relation, a discrimination according to provisional adoption or definite acceptance may be important. But we must not forget that, viewed in the light of the activities of inquiry and thought within the entire scientific process, all empirical knowledge is of a relatively provisional nature. A hypothesis is no whit more 'provisional' than a theory; whether or not confirmed or accepted, both occupy an established, and as such by no means provisional, place in the process of scientific activity.2 It could be maintained that in discussing hypotheses the accent is more often on its conjectural character (with regard to a supposed lawfulness in reality) than is the case where theories are concerned. But this is a matter of emphasis. Moreover, there is a better way of saying this. Hypotheses are seldom articulated without mention of their empirical implications, whereas theories can be viewed as logical systems, in abstraction from their empirical references.3 This is done, for instance, when the logical consistency of a theory is studied (3; 1; 2). Again, however, this is only a difference in degree. 1
The distinctions that follow are not very strict — in consonance with the varied usages in scientific contexts of such concepts as prediction and especially hypothesis and theory. Curiously, this circumstance, unlike other divergent usages, seldom seems to be a barrier to mutual understanding. 2 Whenever hereafter — which will be but rarely — a hypothesis is expressly regarded as '/rue,' we shall call it a law. A 'true theory' is then a 'system of laws.' 3 The term theoretical or logical model is often employed with special or exclusive reference to the logical structure of the theory as a deductive system — that is, in abstraction from the empirical content of its concepts and symbols. Its basic elements
3; 2; 2
77
3. F O R M U L A T I O N : A. T H E D E D U C T I V E
PROCESS
The line of division between hypothesis and prediction is more clearly marked. A prediction refers to the expected outcome(s) of specific, i.e., pre-stated critical procedures, to be carried out on antecedently specified empirical materials. The generalizing, 'open,' character peculiar to a hypothesis (cp. categories of phenomena in the definition set out above) is absent in a concrete prediction. In statistical terms: a hypothesis assumes a law or regularity in the universe, irrespective of whatever sampling techniques are to be employed; by contrast, a prediction, in stating the outcome to be expected, refers to a specified sample or way of sampling. Directly in line with this is its most significant characteristic: a prediction is formulated in such a way that it is strictly verifiable, that is, when put to the test it will prove either true or false. When a prediction is tested, it is also verified-, whereas in general a hypothesis can only be confirmed (cp. 3; 4). The distinction is indeed more pronounced than that between theory and hypothesis, but does not altogether eliminate the possibility of differences of opinion or confusion. In the sections that follow, and in Chapter 4, we shall have ample opportunity to tighten up the distinctions thus far outlined. It is worth while to dwell a little further on the manner in which predictions are derived from hypotheses. We are already familiar with the various stages in the process of logical deduction and empirical specification involved. Basically, these are the same as those occurring in the deduction of hypotheses from theories. Into the passage from hypothesis to prediction, however, the singular condition enters that the prediction must be strictly verifiable, when the hypothesis was not. How is this brought about? In the case of so-called deterministic hypotheses this is hardly a problem. The basic pattern of a universal deterministic hypothesis is: 'All A's are B' (e.g.: all children — or all boys in the so-called western cultures — develop an Oedipus complex during their childhood, cp.
3; 2; 3 From hypothesis to prediction
are a set of 'initial statements' and a 'calculus,' that is, a set of rules for the deduction of 'derived statements.' For an analysis of theories of an axiomatic structure, subtler distinctions and a more precise definition of 'model' are needed (cp. e.g. B R A I T H W A I T E 1955, esp. Ch. 4). For our purpose, the broad distinction of theories considered with, and without, their empirical references will suffice for the present.
78
3; 2; 3
3; 2
DEDUCTION
AND
SPECIFICATION
3; 4; 3,example 4). On the strength of this,it can be predicted of any one A that it (he) will be B, and this may prove to be either true or false. Likewise, in the case of deterministic existential hypotheses, any A can in principle be regarded as a test case. Here the basic pattern is: 'There is at least one A that is B' (e.g. precognition does exist, that is, there is actually a person (A) whose paranormal powers enable him to foresee certain aspects of the future (B); or, there is a woman who has actually created great works of art such as symphonies (cp. r e v e s z 1952, IV, 4, where this is in fact denied). Such statements need only be transformed into: 'It is not true that all A's are non-B' to show that, in principle, any A can in fact serve as a test case. Now the prediction is stated in the form 'This A is non-B,' and a search is instituted for one or more cases in which this is not true. In the case of probabilistic hypotheses, however, a single individual cannot serve as a test case; or rather, this will hardly yield relevant information. The basic pattern of such hypotheses is, for instance: 'There are relatively more A's than non-A's that are B.' For example, there are relatively more intelligent boys than girls (cp. 3; 2; 1), or, 'All A's are B, except for such and such a chance of error,' or, 'A and B (e.g. intelligence and income of men) exhibit such and such a degree of correlation,' or simply: 'The population mean is such and such'; all these specifying the respective values. It will be obvious that critical testing calls for samples of more than one individual or case, but even then a verifiable prediction is still lacking. The devices employed in testing such hypotheses — details of which will be discussed later (5; 2; 5) — have this basic feature in common: they make use of conventional confirmation criteria, i.e., criteria created by deliberate fiat. When a sampling procedure is instituted on the assumption that the hypothesis is correct — or alternatively, that it is false and the null hypothesis is correct (cp. 3; 2; 1) — its outcome will exhibit a greater or less degree of 'probability.' By introducing certain assumptions, the investigator can then determine what risk of error is incurred if he decides to reject the (null) hypothesis on the strength of his findings. The usual procedure is to fix, by antecedent convention, a judiciously chosen but arbitrary boundary line marking off acceptable and unacceptable risks of error. The boundary line thus serves to delimit predictions that are positively confirmed from those that are not. This specification by convention enables a probabilistic hypothesis to 3; 2; 3
79
3.
FORMULATION:
A. T H E D E D U C T I V E
PROCESS
be transformed into a verifiable prediction. What is in fact predicted is that, in a specifiable testing procedure, the hypothesis will be positively confirmed according to prestated criteria. But so far we have only come within hailing distance of a concrete prediction. For this to result, it is essential that not only the confirmation criteria — or, in terms of the prediction, the verification criteria (cp. 3; 4; 2) — be fixed, but also the 'specifiable details' of the testing or verification procedure (cp. 5; 2; 3 and 5; 2; 4). Sometimes the creation of confirmation criteria is a little more complicated in that two, instead of one, boundaries are set up, thus marking off three intervals of risk of error. In this case the alternatives are: confirmed, discontinued (alternatively: null hypothesis rejected or not rejected) and, no decision. Or in terms of predictions: positive outcome, negative outcome, non-verifiable. At first sight this special case of nonverifiability appears to upset our neat, if artificial, dichotomy (but cp. 3; 4). From the viewpoint of procedural decision making, however, such a trichotomy may be very useful; the halfway house between acceptance and non-acceptance of the hypothesis is, then, the decision to repeat the investigation, using fresh samples or a different experimental design.
3; 3 E X P L I C I T A T I O N O F A T H E O R Y OR H Y P O T H E S I S 3; 3; 1 Explicitation: ramifications
We have seen that the empirical testing or confirmotion1 of a theory or hypothesis requires the testing (verification) of predictions obtained by step-wise deductive specification. Since in this process the scope of the theoretical supposition is narrowed, the verification of a single prediction can be of no more than limited import for the testing of the theory or hypothesis as a whole. Ceteris paribus, the confirmation value of a positively verified prediction, or of a derived, more specific, hypothesis will be smaller as the number of intervening steps of inference and their degree of specificity 1
This term — literally 'strengthening' or 'support' — is most commonly used nowadays, naturally with the proviso that the support can also be negative. In our terminology, confirmation means testing inclusive of the first stage of evaluation, that is, assessment of the "confirmation value' of the outcome (cp. Ch. 4).
80
3;3;1
3; 3
EXPLICITATION
OF A THEORY OR
HYPOTHESIS
is greater, or in other words as the 'logical distance' from the original theory or hypothesis increases. Agreement with the theory in a special case, or under special conditions, is of little consequence for the validity of the theory in its entirety, but the introduction of 'special conditions' will often be well-nigh inevitable if a verifiable prediction is to be produced. One may of course try to minimize the difficulty by severely restricting the number of specifications and their degree of specificity, or in other words, by putting, whenever possible, quite general consequences to the test. In fact, it is often good policy to adopt this as a maxim for the more fundamental types of hypothesis testing. Insofar as technical and logical considerations permit, it is generally desirable to keep a tight rein on the process of deductive specification. In actual practice, however, it will nearly always prove necessary — exceptions will be discussed in 4; 2; 1 — to introduce at least a number of carefully and critically chosen specifications. In general, the testing of the body of a theory is of necessity carried out in piecemeal fashion — one or a very few of its consequences at a time. Adequate confirmation of a theory or general hypothesis, therefore, can only be obtained by verifying, not just one, but a number of predictions concerning the outcomes of several testing procedures. The inevitable narrowing of the scope and content of the theory throughout the deductive phase can only be compensated for by producing more than one series of specifications, by tackling a number of the specific ramifications of the theory. The transformation —• through multiple deductive specifications — of a general theory or hypothesis into a ramified system of interlocking, more specific hypotheses, and eventually predictions, constitutes what we call the explicitation of the theory or hypothesis. The greater the 'logical distance' between the theory and the predictions — the more general or comprehensive the theory — the more major and minor ramifications will have to be investigated and confirmed in order to anchor the theory firmly to empirical fact. Each branch will require an empirical cycle of its own, in which the stage of hypothesis formation consists in the articulation (and formulation, phase 2) of the sub-hypothesis to be confirmed or the prediction to be verified.
3; 3; 1
81
3. F O R M U L A T I O N : A. THE D E D U C T I V E PROCESS
3; 3; 2 Nomological network
For a more detailed description of the system of deductions and specifications which together constitute the explicitation of a theory, we need to introduce a few terms and draw some distinctions. A theory together with all its ramifications, insofar as these have at a given time been empirically worked out and tested, may be designated as the then available nomological network or net of the theory. Such a network may naturally be at different stages of actual realization. Ideally, it would provide 'complete' coverage of the area of reality with, preferably, nothing but positive confirmation outcomes — a 'completeness' that can be said to have been attained when the theory is definitely accepted by the forum of co-scientists as a system of laws.1 We may apply a similar definition to a portion of a theory, e.g., a hypothesis or sub-hypothesis together with all its available, relevant connections ('higher order' and 'lateral') and all its actualized ramifications ('lower order'). We shall accordingly designate this as the nomological net or network of a hypothesis. Similarly, it is possible to refer to the nomological network of a theoretical construct. Admittedly, we prefer to regard science in general, as well as a nomological net in particular, as a system of statements rather than of concepts (POPPER (1934) 1959, p. 35), but it is nonetheless a fact that these statements make use of concepts. The (sub-)system of those statements which either employ a given construct or contribute in some other way to the identification of its theoretical content or empirical meaning, constitutes the nomological network surrounding the construct in question. Some detailed illustrations may be found in CRONBACH and MEEHL 1955, p. 190 ff. (cp., too, Ch. 8, esp. 8;2;3).
3; 3; 3 Three types of relations
The nomological network of a theory comprises: the theoretical model, with its purely deductive consequences (gl and pi deductions), which is conceivable in abstraction from empirical reality; its derived hypotheses and predictions, both with empirical references (gs and ps specifications); and,
1 The word 'nomological' (literally: in terms of laws) suggests this ideal state, which, judging from their description, the initiators of the net in socio-scientific literature (cp. esp. C R O N B A C H and M E E H L 1955, p. 187) had in fact in mind. In our usage, however, the term emphatically refers to the theory and the pertinent evidence at a certain stage of development.
82
3; 3; 3
3; 3 E X P L I C I T A T I O N OF A THEORY OR HYPOTHESIS
finally, the 'evidence,' the factual empirical outcomes of investigative procedures. If we adopt the dichotomy of theoretical (hypothetical) constructs as against empirical variables and observations, it may be argued (CRONBACH and MEEHL 1955, p. 187) that three types of statements can be distinguished in a nomological net: those relating A. theoretical constructs or variables to each other; B. observable variables (properties, quantities) to one another; C. theoretical constructs to observable variables. The logical relations of type A primarily encompass statements of connections among basic concepts within the theoretical model: definitional relations, postulates, and deductively derived theoretical statements. The empirical relations of type B comprise primarily statements concerning factual findings, results, outcomes of investigations. But in between these extremes there is a class of statements which can be construed either way, that is, as falling within either class A or B. The typical case in point is the hypothesis. A hypothesis considered in abstraction from its empirical references, as a consequence derived from the theory solely by logical deduction (gl and pi steps), must be classed as A. Considered conjointly with its empirical references, that is, inclusive of the gs and ps specifications, the same hypothesis may be said to constitute an attempt at summarizing the obtained (and obtainable) empirical findings, which would class it as an empirical relation of type B. This holds also for lower-order hypotheses labeled 'true,' that is, for empirical laws (e.g. women are on the average smaller than men). True, the prevailing view is that these are simple, conveniently generalized summaries of observational findings. But it is in fact possible to regard them as purely logical consequences derived from a more general law, hypothesis, or theory; for instance, a more general law relating sex to growth (in height) in mammals, or a general theory concerning the influence of sex hormones on human growth. Even a prediction can be construed either way: as B, that is, as a statement of fact in the form of a prediction; but also as A, when, in abstraction from its gs and ps references, it is viewed as a purely logical consequence derived from a hypothesis. So we find that a precise classification as either A or B is possible only for extreme types: theoretical, definitional statements and/or postulates, as opposed to factual statements. For most other types of state3; 3; 3
83
3. F O R M U L A T I O N : A. T H E D E D U C T I V E
PROCESS
ments, particularly those in the important class of hypotheses, A and B furnish insufficiently distinctive criteria; they mark two different views of the same statement. It has been suggested that hypotheses should be classed as C since they supposedly tie theoretical concepts to observables (MARX 1956, p. 7). However, if they do so, it is only by virtue of their dual nature. They can be read in terms of either A or B; in what they state, however, they do not typically link constructs with observables. In consequence, we prefer to reserve class C for (statements concerning) the empirical specifications themselves — of types gs and ps. Since these do not derive logically, through an accepted set of rules of deductive inference, from the initial statements of the theory (or hypothesis), they certainly form a class apart, and are as such an indispensable part of the nomological net. Statements of type C define the relations between constructs and empirical matters of fact. They stipulate how these concepts are to be transformed into observables, and thus what empirical content they are to be given when used in hypothesis testing. Only by means of C statements can A statements, particularly hypotheses-as-derived, be transformed into B statements, that is, hypotheses of an empirically testable form, and hence into verifiable predictions. 3; 3; 4 Operational definitions of constructs
Empirical specification statements of type C above — or, according to the distinction employed in 3; 2; 1, of types gs and ps — evidently occupy a key position in theory explicitation. They explicate, they specify, or — in the case of ramifications connecting a construct with more than one empirical variable — they identify in network terms the intended content of a construct and pin down its meaning. They have a definitional function, which must of necessity be fulfilled if the testing of theoretical or hypothetical statements is to be made possible. Whenever a concept or construct is to be used in an empirical investigation, a minimum of empirical specifications is needed. For, 'using' a construct implies setting up certain distinctions between cases where it is or is not applicable, or between cases of varying degrees of applicability. At the very least one must draw a boundary line somewhere — 'defining' means setting limits. This boundary line must be marked clearly enough to enable the investigator to discriminate — objectively, adequately (with 84
3; 3; 4
3; 3 E X P L I C I T A T I O N O F A T H E O R Y O R
HYPOTHESIS
respect to the construct-as-intended) and with sufficient reliability1 •— between A and non-A cases; e.g. between boys and girls, intelligent and non-intelligent children (cp. 3;2;1), social groups and collections of people not to be included in the construct 'group,' democratic and nondemocratic forms of government, etc. Frequently the distinctions will be carried further, in that more than two categories are marked off (Catholic, Protestant, Nonchurch); or a graded scale will be drawn up to 'measure' height, intelligence quotient, price index, etc. To these ends are needed one or more empirical specification statements providing an objective instruction on how to proceed in given empirical cases, so as to effect the distinction between A and non-A, or between different scale values. Once the construct has been tied to a distinctive criterion by means of such an objective instruction, we say that it has become an empirical variable (cp. Chs. 6 and 7). The instruction then specifies the operations (observation, recording, ordering, categorization, calculation) to be carried out — for any given case — to determine the quantitative or qualitative 'value' of the variable. Thus, in a psychological investigation, for instance, the concept 'intelligence' is empirically specified by the set of instructions for the operations of administering and scoring test X, calculating the IQ, and possibly classing the subject under 'high' or 'low' intelligence (cp. 3; 2; 1); the concept 'sex' by the instruction to the children 'Boys write " B " and girls " G " on your answer sheet.' It will be clear that such a set of instructions defines the concept. A definition on this basis is called an operational definition. 2 The principle of testability (3; 1; 4) can now be amplified as follows: (1) A theory must permit the deduction of at least a number of hypotheses that can be tested; (2) therefore, at least some of the concepts in each hypothesis must be amenable to empirical manipulation; (3) to this end, these concepts must undergo empirical specification (that is, be transformed into empirical variables) by means of operational definitions of the distinctions relevant to the use of the concepts. 1
For a further analysis of the notions 'objective,' 'adequate,' and 'reliable,' thus baldly introduced here, compare Chapters 6, 7 and 8. 2 The operations specified by an operational definition need not be empirical specifications. In mathematics, for instance, or in theoretical physics (BRIDGMAN 1928), a formula or instruction stating how Y is to be calculated from X as found, may be considered an operational definition of Y.
3; 3; 4
85
3.
FORMULATION:
A. T H E D E D U C T I V E
PROCESS
3; 3; 5 Relation between construct and variable
Various relationships are possible between the (theoretical) construct-as-intended and the (empirical) variable-as — operationally — defined. In 3; 2; 1 we have introduced a distinction between two kinds of empirical specification: 'specifying' in the sense of stating the original content more completely and in greater detail, without loss of generality, type gs; and 'specifying' in the sense of making the original content more 'specific' as well (i.e., more particular), thereby reducing the generality of the original contention, type ps. This distinction may be applied not only to individual specification steps of various kinds, but also to entire operational definitions of constructs. The operationally defined variable may completely cover the construct (type gs), or it may only partially cover it (type ps). The case of complete coverage (gs) presents few problems. It will be met with particularly in operational definitions which essentially do no more than state a method (and an instrument) by which a variable is to be 'measured' in the everyday sense of the word (cp. 7; 2). A thermometer is used to 'measure' temperature; but it is legitimate to say that the number of degrees read 'is' the temperature — albeit different scales may be used. Likewise, for instance: reaction time = number of (hundredths of) seconds read in a correctly designed and executed reaction experiment; height = number of centimeters (inches) read; output = number of finished units counted. Such (empirical) concepts have no appreciable 'surplus meaning' compared with the operationally defined variables (cp. 2; 3; 6). Much more problematic are cases of partial coverage (ps), that is to say, partial coverage of either the concept or the distinction intended. It is essential to differentiate between these two cases; otherwise the term 'partial coverage' might easily give rise to misunderstandings. A good example is furnished by the distinction between boys and girls discussed in 3; 3; 4. The operational definition, based on the instruction: 'Boys write " B " on your answer sheet, etc.' was properly construed as a gs case, since for most purposes it may be assumed that this rough and ready procedure causes little, if any, loss of intended meaning. The distinction-as-intended is completely covered by the distinction-asdetermined, but this by no means implies that we have thus given a definition 'covering' what sex is, much less what a boy or girl is.1 1
86
The confusion arises because the function of a restricted, 'stipulative,' operational
3; 3; 5
3; 3 E X P L I C I T A T I O N
OF A THEORY
OR
HYPOTHESIS
It is generally true to say that operational definitions are seldom possible or even desirable for concepts denoting systems or objects which implicitly define concrete-empirical or abstract-hypothetical entities, systems, or processes such as 'molecule,' 'state,' 'human being,' 'puberty,' 'ego.' Objects in this sense, however, are by definition possessed of properties or attributes (TORGERSON 1960, p. 9). Even with high-order hypothetical constructs (cp. 2; 3; 6), it is therefore possible, and sometimes necessary, to establish operational distinctions between (categories of) objects, or to operationalize some of the delimitations of an object based on one or a few of its attributes. Naturally, however, it is concepts denoting attributes which are pre-eminently in need of being operationally defined. The further discussion in this section and in later chapters (4; 2; 4 and Ch. 8) will deal almost exclusively with concepts or constructs denoting attributes. In some cases of partial coverage, an operational definition providing complete coverage is available in principle, but its application requires procedures so laborious and time-consuming that the investigator contents himself with an approximation. He does know exactly what he would like to determine (measure), but practical considerations prompt him to attempt no more than an indirect, approximative measure.1 In economics and demography such operational definitions are of frequent occurrence; real cost or real frequencies are approximated by, largely indirect, standard methods of estimation. Kinsey, too, knew precisely what he meant by 'frequency of sexual outlet,' but he could obtain pertinent data only from what his subjects related in interviews (KINSEY et al. 1948). Another example of partial coverage and approximation may be found in studies on the predictability of success at college. Since it is often not feasible to wait for the emergence, after many years, of a definitive distinction between degree holders and non-degree holders, an intermediate criterion, that is, an approximative operational definition of 'success at college,' will be introduced (cp. e.g. TECHN. INST. DELFT 1959). definition is unwittingly mistaken for other types of definitions — which abound (cp. e.g. ROBINSON 1950). 1 This does not include the case of statistical estimations of population parameters from sample findings. There the estimated result remains an estimation of the 'original' variable; there is no substitutive operational definition involved.
3; 3; 5
87
3. F O R M U L A T I O N : A. T H E D E D U C T I V E
PROCESS
More problematic are those cases in which the non-operationalized surplus meaning of a construct cannot be precisely specified. In the abovecited report this difference is brought out clearly when the discussion shifts from (empirical) academic achievement to the operationalization of 'potential capacities' for college work (op. cit. Ch. 9). In general, it is the more theoretical, hypothetical constructs (denoting attributes) which present problems, mainly because they are associated with diverse empirical phenomena. It is difficult to express the 'degree of prosperity' or 'standard of living' in a country, the 'intelligence' or the 'degree of adjustment' of an individual, in a single variable (index or test result) otherwise than by introducing a fairly arbitraryps specification. Examples are: the price of bread as the index of 'the standard of living'; the score obtained in test X as the index of a subject's 'intelligence,' or 'social adjustment.' An interesting aspect of the example of intelligence is that here one concept proliferates a large number of more or less accepted, operationally defined variables (test methods). Each test definition is a relatively arbitrary specification (ps) of the construct. Naturally, various differing intelligence tests must have an empirically identifiable common denominator — which since SPEARMAN (1904) has often been designated as the 'general factor g.' As things are, the content of the concept 'intelligence' might conceivably be identified by listing all acceptable measurement techniques and by making reference to the general factor which could be empirically obtained with a sufficiently large and representative sample of the population. 'In principle,' this would again yield an operationally defined variable; albeit the variable can only be approximated, in a number of ways. It is probably correct to say that, if we discount its approximative character, such an operational definition provides 'complete coverage' of the construct 'intelligence,' at any rate as it is understood in differential psychology. In similar fashion, an analysis might be attempted of the relations of other constructs and concepts to their operational definitions, for instance the 'neuroticism' of a subject, the degree of 'interaction' in a (psychological) group, the degree of 'readability' of a text, the distinction between 'democratic' and 'autocratic' leadership, the 'status' of an occupation — to mention just a few well-known examples. The problematic aspect of cases of 'partial coverage' will be obvious: the basic question is to what extent the operationally defined variable is 88
3; 3; 5
3; 4
THE SCIENTIFIC
PREDICTION
an adequate representative of the construct. This question will be taken up again in Chapter 8. Another important question has to do with the development of constructs: to what extent are operational definitions ploughed back into the construct itself? This question of the interaction between construct and variable will be dealt with in some detail in Chapter 4 (cp. 4; 2; 4). 3;4 T H E S C I E N T I F I C
PREDICTION
3; 4; 1 Function, content, characteristics
One point that must have emerged clearly from the foregoing is that the concrete prediction, the last link in each of the deductive chains through which a theory is explicitated, occupies a key position in empirical science. 'If one knows something to be true, he is in a position to predict; where prediction is impossible there is no knowledge' (1; 3; 1). We shall now examine the function and characteristics of predictions in more detail. The term 'prediction' is used in empirical science in a specific sense which is in keeping with the function that predictions have in the scientific enterprise. We have seen that hypotheses worthy of that name 'must allow the deduction of verifiable predictions, the fulfillment or nonfulfillment of which in critical tests will provide relevant information for judging the validity or acceptability of the hypothesis' (3; 1;4). Stated otherwise, the function of the prediction in the scientific enterprise is to provide relevant information with respect to the validity of the hypothesis from which it has been derived. If we try to be more specific, and in particular to answer the question what is predicted, we find that the scientific prediction differs from the ordinary notion of 'prediction' in these respects. 1) A scientific prediction — in the context of a testing procedure — is always derived from a hypothesis. It has nothing of the oracular; it is not thrown off at random; nor is it based on hunches or on implicit theories; invariably a scientific prediction is a deductively derived specification of an explicitly formulated hypothesis. The hypothesis, in turn, does not spring full-blown from an empty void. Logically, it derives as a rule from a theory of more general import; empirically, it builds on findings from earlier investigations. 2) That which is predicted is always the outcome of a precisely speci3; 4; 1
89
3. F O R M U L A T I O N : A. T H E D E D U C T I V E
PROCESS
fiable testing procedure. The prediction states what will in a certain respect be found if 'specific, i.e., prestated critical procedures are carried out on antecedently specified empirical materials' (cp. 3; 2; 2). No general precepts can be given for the nature and the amount of the materials, the number of observations or cases required for one prediction, or the operations that the raw materials of observation must undergo before the actual outcome is obtained. Much will depend on the hypothesis, with regard to which the prediction is to provide 'relevant information.' The prediction may refer to just one observation in a single crucial experiment, but equally to some laboriously computed outcome of the processing of extensive data, comprising numerous observations or cases. 3) A scientific prediction may be made with respect to events in the past as well as in the present or in the future. In this connection psychologists sometimes differentiate between 'prediction' and 'postdiction,' but this is not a fundamental distinction. The justification of the term prediction ( = foretelling) is that, in principle, the verifying procedure is situated in the future and has a still unknown outcome, which can therefore be predicted. The outcome itself, however, may well be the result of events in the remote past. A historian, for example, will predict that, upon investigation of hitherto imperfectly studied texts, it will be found that in the year... Charlemagne attempted...; or a geologist will predict on the strength of a theory that, in certain strata of certain localities, certain fossils so many millennia old will be found. A closer resemblance to the predictions of everyday life is displayed by those scientific predictions which refer to future events. Predictions such as: 'Rain forecast for Sunday,' 'A business recession is at hand,' 'A, unlike B, will successfully complete his studies,' or even in crystal ball fashion: 'A dark woman will cross your path' may actually derive from, and contribute to, the testing of theories or hypotheses — of a meteorological, economic, psychological or parapsychological nature, as the case may be. Still, there are differences. For one thing, a scientific prediction does not foretell the event itself but states that, upon the institution of a strict and objectively prearranged verificatory procedure, its overt manifestation will be ascertained. Hence it is always in the form: 'In an objective investigative procedure of this (...) nature, it will be found that (...).' 4) The precise wording, be it noted, is: 'In an (...) investigative procedure (...), it will be found that (...),' not: 'In this (...) investigative 90
3;4;1
3;4
THE SCIENTIFIC
PREDICTION
procedure, etc.' What this formulation is meant to point up is that the investigative procedure whose outcome is predicted is considered to be in principle repeatable. Stated otherwise, a scientific prediction, all antecedent specification notwithstanding, still has a generalizing character. Even when in a concrete case attention is focused exclusively on the issue, here and now, of this particular procedure executed in this particular manner — e.g., with respect to certain observations of a solar eclipse or the outcome of forthcoming elections — 'this' procedure is still a specimen of a verification method. To be sure, it is often asserted that repeatability is peculiarly characteristic of the experimental natural sciences, and is hardly if ever met with in such disciplines as history and political science. Such assertions, however, refer to the repeatability of sampling procedures and experiments. This is indeed impossible when the universe under consideration is too limited (cp. 9; 2 and 9; 4). Even so, the verification process is still repeatable in the sense that it is assumed that the investigative procedure could have been carried out by another (similarly qualified) observer, or with other specimens of the instrument used, or at a different time. This form of repeatability is no more than a consequence of the requirement of objectivity: the outcome must not depend on the individual observer, the peculiar characteristics of one particular instrument, or on accidents of time (assuming of course that no particular point of time is specifically stipulated in the prediction). Every verification that follows the prescribed methods is legitimate. 5) Another consequence of the fact that a scientific prediction foretells the outcome of an investigative procedure is that it is never completely unconditional: the procedure may even be completely foiled because the situation in which it was to have been instituted does not materialize or cannot be made to materialize. It may be found that certain observations cannot be made with sufficient accuracy or cannot be made at all — in astronomy, for instance, because of weather conditions at the time of a major solar eclipse; in history because of gaps in certain newly discovered historically important texts; or, in the case of a statistical prediction, because it is found impossible to collect sufficiently extensive materials. Another possibility is that, on the face of it, the test proceeds quite satisfactorily, but that in the course of it fresh data come to light, which render the outcome, whatever it may be, totally inconclusive; that is, 'disturbing factors' have occurred. 3; 4; 1
91
3.
FORMULATION:
A. T H E D E D U C T I V E
PROCESS
In experimental tests of causal hypotheses, for instance, every precaution will as a rule be taken to eliminate any influence of factors other than those whose (predicted) effect is being investigated (cp. 5; 1; 2 and 5; 3; 2); but it may be found afterwards that these efforts have been in vain. Frequently, a difference in effect (possibly behavior) is predicted between two cases, two groups of subjects (experimental and control, respectively), two conditions (states), which exhibit a systematic difference in the causal factor under investigation, while they have in all other respects been standardized as rigorously as possible. Nevertheless, it may turn out that the 'ceteris paribus' condition implicit in the prediction is (still) not fulfilled. A special case is that in which the effect of the observer, or the investigative procedure itself, interferes with the proper observation under inquiry. A well-known, if rather primitive, example is the spectacular failure of the study on the effect of work breaks on hourly output by women factory workers in the first part of the Hawthorne experiments ( R O E T H L I S B E R G E R and D I C K S O N (1939) 1949, Part 1). It was found that the effects produced on the employees by the investigative procedure itself — exposure to the limelight and a different social footing — were so great that they completely obscured any possible effect of the experimental factor (the variations in work breaks). For a further analysis of these problems we must refer to 5; 1, where the design of testing procedures is discussed. At this point it will suffice to note that a prediction is apparently always made in a conditional form. The conditions seek to ensure that the verificatory procedure shall proceed as planned in connection with the content of the hypothesis. If these verifiability conditions are not met, the question whether the prediction is fulfilled cannot be answered. 6) A prediction is of scientific interest only if its fulfillment or nonfulfillment provides relevant information with regard to the hypothesis from which it has been derived; a condition that does not apply to predictions in everyday life. Basically, this is obvious enough; the question, however, of what determines the relevance of the prediction is by no means easy to answer. In Chapter 4 we shall take up this question in more detail (see 4; 1; 3). 7) A scientific prediction must conform to strict logical requirements; in particular it must be strictly verifiable. This, too, marks a significant difference from predictions in daily life, requiring further analysis.
92
3;4;1
3; 4
THE SCIENTIFIC
PREDICTION
3; 4; 2 Verifiability conditions and verification criteria
The statement that a prediction must be strictly verifiable does not mean that it should be unconditional. We have seen, in fact, that every scientific prediction can only be verified under certain conditions. What the statement does mean is that certain standards must be met in the formulation of the prediction, the design of the testing procedure, and the advance arrangement of criteria regarding potential outcomes. These standards seek to ensure that, once the outcome, whatever its nature, has been obtained, the investigator can establish objectively and with certainty whether the prediction (a) has proved true, (b) has proved false, or (c) cannot be verified. For case (c) to be distinguishable from cases (a) and (b), the verifiability conditions discussed in 5) must have been antecedently fixed in explicit form. For (a) to be distinguishable from (b), once these conditions have been met, precise verification criteria must have been established in advance. To start with the verification criteria: only if these have been antecedently fixed in a strictly operational form, can fulfillment and nonfulfillment of the prediction be differentiated with certainty. These criteria mark off a 'range of positive outcome,' that is, the set of all situations or events that are taken to establish the prediction as 'proven true' (cp. V A N D A N T Z I G 1952, p. 197). Sometimes, in addition, a 'range of negative outcome' will be defined. If the outcome falls within this range, the prediction is considered to have been proved false. As we have seen in 3; 2; 3, these ranges need not be contiguous; in certain cases a 'no man's land' in between is distinguished. If the outcome falls within the latter area, the decision is deferred. This means, in effect, that the verifiability conditions have not been met (case c). If a prediction is made in terms of quantitative values that may be found for a variable, the ranges of positive and/or negative outcome have the character of intervals. Thus, in the case of statistical predictions, where it is sought to prove the existence of a causal factor from its effect, the interval of positive outcomes is defined by the limits outside which the null hypothesis is to be rejected. It is common practice that such limits are fixed by selecting in advance a conventional level of significance: e.g., 5%, 1 %, or .1 % risk of error in rejecting the assumption that the null hypothesis holds in the universe under consideration. Though naturally a judicious choice will be made, it remains nonetheless 3; 4; 2
93
3. F O R M U L A T I O N :
A. T H E D E D U C T I V E
PROCESS
arbitrary. At all events, once it has been made, there has been created an interval of positive outcome, and thus a verification criterion. Any outcome falling within it stamps the prediction as 'proven true.' Similarly, the prediction of the magnitude of an effect, in testing a quantitatively formulated hypothesis, will often call for the adoption of an interval of positive outcome. A prediction may be precise to a greater or lesser degree; this precision (or rather imprecision) can be expressed in some probabilistic measure for the magnitude of the interval of positive outcome (VAN D A N T Z I G 1952, p. 197). The requirement that verification criteria be antecedently fixed, therefore, implies also that the degree of precision of a given prediction must be carefully weighed in advance. As for the verifiability conditions, the requirement that they be precisely formulated in advance cannot be fulfilled in a literal sense. This would mean that each and every manner in which the verificatory procedure could possibly miscarry would have to be set down beforehand. This, however, is not only impossible but also unnecessary in such an extreme form. First of all, some verifiability conditions are so self-evident that they do not require explicit formulation. Whenever an instrument is used, it is natural to assume that it works, that it registers correctly and has been properly standardized. It will also be assumed that no clerical errors are made, that calculations, whether by human agents or by computers, are correct, and that the protocols and records produced by both human observers and instruments provide adequately reliable data. To be sure, all these details will be a source of constant concern and vigilance on the part of the investigator, but they do not require explicit formulation. Secondly, a somewhat similar comment applies when the situation to which the prediction refers is not realized. Needless to say, the prediction then cannot be verified. Difficulties will occur only when doubt arises: is this a situation as envisaged in the hypothesis (theory), so that it provides a genuine test of the prediction derived from it? This problem leads up to our next point. If the theory or hypothesis from which the prediction has been derived is well formulated, it will shoulder a good deal of the burden. A well formulated theory (hypothesis) states to which situations it does and to which it does not relate (3; 1; 5). Sometimes the formulation of, for instance, a causal hypothesis from which a predicted difference between 94
3; 4; 2
3; 4
THE SCIENTIFIC
PREDICTION
two states, two cases or groups has been derived, will include an express reference to the verifiability condition: 'provided that, on the one hand, a sufficiently marked difference in the experimental factor and, on the other hand, sufficient control over all other factors can be experimentally realized.' The ceteris paribus is thus part and parcel of the prediction. If the experimenter does not succeed in carrying this over into practice, the prediction (and ultimately the theory or hypothesis) cannot be blamed. It is to be noted, however, that in the behavioral sciences even the major potential disturbing factors can but rarely be suppressed to anything like the same extent as is commonly possible in the physical sciences. At best, they can often be optimally 'randomized,' that is, systematically left to chance (cp. 5; 1; 2 and 5; 3; 2). As a result, there are, unfortunately, fairly frequent deadlocks, in which it proves impossible to discriminate between cases (b) and (c) or between (a) and (c). There is no way of telling, then, whether or not the verifiability conditions have in fact been met, and thus whether the result is to be taken seriously as the outcome of a verificatory procedure. However, it will have become apparent that avoidance of such an impasse is not only a matter of formulation of the (theory-hypothesis-)prediction, but above all a matter of investigative (or experimental) technique. We can now rephrase our earlier observation about the burden to be borne by the theory (hypothesis) as follows: if the impasse of an ambiguous verification outcome is clearly due to deficiencies in the formulation of the prediction, these deficiencies often argue inadequate theory (construction). Either there is no theory available, or its implications have net been worked out in sufficiently precise detail; the verifiability conditions cannot be inferred from it. By way of illustration, a few simple examples are given below, with brief comments. If the reader is so inclined, he might analyse for himself in greater detail where the shoe pinches. 3; 4; 3 Lack of falsifiability and other shortcomings
The most common form of non-verifiability of a prediction — discounting inadequate provision for verification criteria — is that in which it proves impossible to discriminate (b) clearly from (c). The prediction can indeed be verified if it is fulfilled — (a) is distinguishable from (b) and (c) — but not if it remains unfulfilled; it is not falsifiable. 3; 4; 3
95
3.
FORMULATION:
A. T H E D E D U C T I V E
PROCESS
Example 1. 'This politico-economic system — capitalism or communism — will break down in the long run.' If we assume that this is a 'prediction,' derived from a socio-economic theory, it is as such clearly deficient. If the system breaks down, e.g., because within a few years it is overthrown or radically changed, this fact can quite easily be established objectively; provided that the verification criteria had been properly fixed. The prediction is therefore verifiable if it is fulfilled. However, even though it should not be fulfilled within the next few years, it may still be fulfilled 'in the long run.' Since no time limit has been set, the 'prediction' is not falsifiable; therefore it is not a scientific prediction. Example 2: 'Jack undoubtedly has sufficient capacities to finish the Gymnasium; if he outgrows his present personal problems, he will certainly graduate.' If such a statement is made as a genuine prediction — and not for instance as a counseling 'ploy' in an interview with Jack's parents — the antecedent clause states a verifiability condition: if he does not outgrow his problems, verification of what is stated in the consequent clause is impossible. This is in itself perfectly legitimate, provided that fulfillment and non-fulfillment can be precisely and objectively differentiated. That, however, is not the case with such a vague formulation as 'outgrow his problems.' As a result, the prediction is indeed capable of being clearly fulfilled—if Jack graduates, this proves that he 'has sufficient capacities' — but its non-fulfillment cannot be clearly established. For, if he fails to graduate, it will hardly be possible to determine whether we have to do with case (b) or (c). Such 'predictions' are safe, because they can never go wrong; they are not falsifiable and, in consequence, scientifically unacceptable. Example 3: In the same example another complication may occur. Suppose there exists virtual certainty that Jack will not go to the Gymnasium but to a Trade School, or that Jack's problems are such that it is highly unlikely he will 'outgrow them' in the next few years. Then, too, the 'prediction' is safe, this time because it is tied to a verifiability condition which in all likelihood will not be met. As far as statistical predictions are concerned, the following statement applies: 'In the ultimate prediction conditional probabilities involving conditions which themselves have zero probability must be avoided' ( V A N D A N T Z I G 1952, p. 196). Example 4; Special difficulties will be encountered in endeavors to 96
3;4; 3
3; 4 T H E S C I E N T I F I C
PREDICTION
derive predictions from depth psychological hypotheses of this general type: 'Phenomenon A is always accompanied by (or: always arises from) psychic state B; the latter may be conscious or unconscious.' Hence may be derived the prediction: 'On analysis of any A case (along prestated lines) there will always be found the pattern associated with B.' Here, too, there is no problem, if the B pattern is found in a clearly marked form; case (a) is distinguishable from (b) and (c). If B is not found at all, or not very distinctly, the position is more difficult. For B may be 'unconscious,' and perhaps so deeply unconscious that it escapes detection by the prescribed technique. It is always possible to appeal to a 'deeper level,' which could not be reached by the verificatory procedure. Unhappily, this way-out all too often assumes the character of a subterfuge (DE G R O O T 1950a). Such 'hypotheses' and 'predictions' are not falsifiable and, therefore, not scientifically acceptable. The only way to make them acceptable is to set up an operational criterion to distinguish between the cases B (unconscious) and non-B. This, however, is all too often omitted. 1 It may happen, for instance, that the aggressiveness 'found' in the Rorschach test, or, a theoretically expected mother fixation, cannot be identified in the subject's manifest behavior. We often see the unwarranted assertion made that the aggressiveness is 'unconscious,' or that the fixation does exist 'at a deeper level.' The universality of, for instance, the Freudian hypothesis of the Oedipus complex, or of the Adlerian attribution of every behavior problem in a child to its feelings of inferiority, is — and of course always can be — upheld by non-differentiation between case (b): non-fulfillment, and (c): verifiability conditions not met. 2 Thus manipulated, such general contentions are not falsifiable; they are not scientific hypotheses, but at best 'interpretational schemes' (cp. 2;2;5). This discussion of the scientific prediction concludes our examination 1
Worse still is the situation arising when theories and hypotheses of an investigator or school — particularly those aligned with 'dynamic' movements — are modified so often, and 'date' so quickly, that every attempt to test them is always 'too late.' Like the flight to the 'deeper level,' this may assume the character of evasion of scientific testing (cp. DE GROOT 1956a). 2 The latter example has a certain historic significance: K.R. POPPER was led to assign central importance to falsifiability as a requirement for, and indeed as the hall-mark of, scientific theories (1934) after he had for some time worked in Vienna under Alfred Adler (Popper in an unpublished paper read at the Signifies Congress, Bussum, 1946).
3; 4; 3
97
3. F O R M U L A T I O N : A. T H E D E D U C T I V E
PROCESS
of the major fundamental aspects of the deductive process. For a more detailed example we must refer to Chapter 5. First, however, the question arises of the 'way back,' that is, the manner in which, through testing and evaluation, the results of empirical investigations affect hypotheses and theories. We must also be familiar with the basic features of this process (see 4;1 and 4;2), if we are to draw conclusions with regard to the requirements to be imposed on the formulation of theories and hypotheses (4;3).
98
3; 4; 3
CHAPTER 4
F O R M U L A T I O N OF T H E O R I E S AND B.
HYPOTHESES
CONFIRMATION
4; 1 C O N F I R M A T I O N O F H Y P O T H E S E S
4; 1; 1 Deterministic hypotheses
Having dealt with the deductive process (phase 3), we must now embark on a discussion of the processes of testing and evaluation (phases 4 and 5), again with a view to answering the question concerning the requirements to be imposed on the formulation of theories and hypotheses. For the present, the more technical aspects of hypothesis testing, particularly on the experimental and statistical side, may be disregarded. The paramount question is: how is the confirmation of hypotheses and theories brought about ? How do the results of investigations lead to conclusions regarding the validity and/or value of the hypotheses and theories, for the testing of which they were undertaken? Our concern is accordingly with the way back: how do observational or experimental findings react upon the theory? First, we shall examine the case of a deterministic hypothesis. If it is of the positive universal type, it can be reduced to the basic pattern: 'All A's are B' (cp. 3;2;3). In this category are included many relationships conceived as causal. B, for instance, will be a presumably necessary consequence of A (e.g. the fatal issue of a given incurable disease), or a necessary condition (e.g. the phenomenon of the business slump occurs only in a capitalist system); or A and B are both consequences of C (e.g. in genetics: invariably linked characteristics or properties). It is by no means necessary that the relationship be a causal one; actually, however, this is often assumed, if one precisely identifiable phenomenon (A) is invariably coupled with another (B). In 3;2;3 we have seen that any one A can now serve as a test case for 4; 1; 1
99
4. F O R M U L A T I O N : B. C O N F I R M A T I O N
a prediction derived from such a hypothesis. But what significance does the outcome have with regard to the contents of the hypothesis? This will evidently depend on the outcome: whether the A investigated is really B. If the answer is in the negative, the hypothesis is clearly disproven; if positive, the hypothesis is by no means proven. A single case of positive outcome is evidently insufficient; even a large number of positively verified predictions does not constitute firm proof. Only by investigating the whole universe of A cases can absolute certainty be obtained that every A is in fact B. Occasionally such proof can be furnished, notably when the universe of A cases is finite, not unduly large, and accessible to verificatory inquiry. It will be obvious, however, that the primary concern of science is to obtain generalizations; generalizations, that is, pertaining to partially inaccessible, or very large, or unbounded (or infinite) universes. Among this last group are to be reckoned all deterministic hypotheses that can be tested by means of illimitably repeatable experiments. Evidently, the validity of the general proposition cannot be logically deduced from the proven truth of specific consequences, however large their number may be. Accordingly, a positive,universal, deterministic hypothesis pertaining to a partially inaccessible or virtually unbounded (or infinite) universe is by nature incapable of being verified, if it is true; what can be verified is only the fact that it is false. Stated more simply: it cannot be positively verified, in the literal sense of being proved true; but it can be refuted or falsified. As for the (deterministic) existential hypothesis ('There is at least one A that is B'), we have already seen that it is in general equivalent to the negation of a positive universal hypothesis: 'It is not true that all A's are non-B.' The reverse applies here: such a hypothesis can be positively verified— a single A case that is B is sufficient — but it cannot be falsified. The invalidity of the hypothesis cannot be logically deduced from no matter how many A cases that are non-B. The contrast between these two types of hypotheses may also be expressed thus. In both cases we are dealing with universal hypotheses of the type: All A's are P (P being B or non-B, respectively, but this is not a fundamental difference). In the one case, the investigator would like to prove this general proposition; in the other, to refute it. As pointed out by K.R. POPPER ((1934) 1959), this does not make a great deal of difference for the methodology of scientific inquiry. Anyone wishing to refute a given general hypothesis, will undoubtedly search for A cases 100
4; I; 1
4; 1 C O N F I R M A T I O N O F H Y P O T H E S E S
that are non-P, but, anyone wishing to prove it, will do the very same thing, albeit in hopes of not finding them! A well-designed scientific testing procedure, in fact, always aims at falsification. A tenable position is that empirical scientific inquiry does not seek to prove (deterministic) theories and hypotheses — for this is impossible — but to refute them, and that the progress of science is in fact based on such falsifications.1 This view has a lot to commend it. Since verification of a deterministic hypothesis of the positive universal, that is, of the most fecund and productive, type is impossible, there is in fact no better alternative than to expose it to the most rigorous testing. If it bears up under the ordeal, we have all the more reason to continue our vote of confidence, though not necessarily beyond the next test that comes along. If it breaks down, we are compelled to make a further move by trying out a new hypothesis. The upshot of our discussion thus far is this: 1) a scientific investigation undertaken to test a deterministic hypothesis must always aim at falsification — either of the hypothesis itself or of an alternative hypothesis; 2) 'falsifiability' itself is a most important desideratum, not only for predictions (3; 4; 3), but also for deterministic hypotheses and theories. 4 ; 1 ; 2 Probabilistic confirmation and probabilistic hypotheses
The unquestionably sound principle of requiring the most rigorous testing of hypotheses still leaves us with the problem of how to weigh this 'rigor.' It would be expedient to have some kind of measure for the confirmation value of a (positive) testing outcome. In some cases, this problem may be solved by means of an approximation in terms of probabilities, which, for the category 'All A's are B,' would proceed something like this. Let it be supposed, first, that it is possible to choose A-test cases at random, so that any one A within the universe has the same probability of being chosen. Let it further be supposed — if only for want of a better assumption — that, if our causal hypothesis is false, there will be in the universe as many A cases that are B as are non-B. On the strength of this latter assumption, which for the present we shall adopt as our null hypothesis (cp. 3; 2; 1), there will be an equal probability of a random A being B or non-B. If we now proceed to investigate (random) A cases and find 1
'We should ring the bells of victory every time a theory is refuted' — this lyrical utterance will not be found in the literature listed; it originates from the above-cited unpublished paper by POPPER (1946).
4; 1; 2
101
4.
FORMULATION.
B.
CONFIRMATION
successively 1, 2, 3, 4... etc. A cases that are all B, it becomes more and more improbable that the null hypothesis is correct. It is possible, for each successive B finding, to calculate exactly the probability that this (or a still greater) deviation would be found if the null hypothesis were correct (equal number of B and non-B). If, now, the actual investigation of successive cases produces the result that this upper limit of probability is smaller than a pre-established conventional value (e.g., P = .01, that is, a probability of one in a hundred of this outcome being found if the null hypothesis is correct) the investigator may decide to reject the null hypothesis — thus accepting a 1 % chance of making an erroneous decision. If greater certainty is demanded, a more stringent significance level may be adopted, e.g., P = .001. Alternatively or conjointly, a more drastic null hypothesis may be formulated, e.g., 'In the universe there are 90% A cases that are B, and 10% that are non-B.' If, in a renewed investigative procedure, likewise employing prestated conventional significance criteria, e.g., again a significance level of P = .01, this null hypothesis is again 'refuted' (in favor of more A's that are all B), the outcome means that the investigator may 'confidently assume' that more than 9 out of 10 cases in the population are B. The measure of 'confidence' is determined by the assumed P = .01. If necessary, the confirmation value of the findings may be stepped up further, thus bringing about an ever closer approximation to the hypothesis whose confirmation is actually being sought (All A's are B). This approximative procedure may seem absurd if it is believed that a strictly causal relationship obtains. In fact, if consistently, indeed exclusively, A's are found that are all B, this particular form will hardly ever be encountered. It assumes major significance, however, as soon as certain complications arise, e.g., in determining whether a given case is really B or non-B. Possibly, the instrument used, or the judge called upon to render a decision, may not be entirely reliable; or, it may be that a strictly deterministic relationship has been assumed, whereas the operational definition that must be employed to establish the distinction between B and non-B is no more than an approximation of the distinction-as-intended (cp. 3;3;5); or again, it may be that some other demonstrable, relatively insignificant but insuppressible, factor upsets the procedure (cp. 3 ;4 ;2). This state of affairs — a deterministic hypothesis which theoretically should, but in fact cannot, result in 100% B findings, because the assumed cause or the assumed effect cannot be clearly segregatedfrom other 102
4;1;2
4; 1 C O N F I R M A T I O N O F H Y P O T H E S E S
causes or effects — is of frequent occurrence in the behavioral sciences. In the circumstances, 100 % B cases are beyond what may in reason be expected, but it is still possible to investigate the existence, and the strength, of the real A-B relation by seeking to reject null hypotheses in the manner described above. In such cases, the hypothesis whose confirmation is actually being sought will frequently be formulated as a probabilistic hypothesis: 'Most A's are B,' or 'Any one A has 80 % probability of being B,' and the like. In this case, the definition of B (and non-B) may indeed be tied to a particular measurement technique (or approximative operational definition). In practice, such a 'failed' deterministic hypothesis is often difficult to tell from a 'genuine' probabilistic hypothesis, in which the operation of a chance process is, in so many words, assumed (e.g., in genetics, in the transmission of genes). It is characteristic of probabilistic hypotheses that exact falsification of a (positive) hypothesis is no longer possible. Obviously, a single 'contrary case' does not suffice to refute a statistical relationship. The difference between positive and negative hypotheses, and between strict verification and falsification, becomes a relative one. Probabilistic hypotheses of whichever type can be neither proved (strictly verified) nor refuted (falsified). At best, they can be confirmed by means of probabilistic confirmation criteria of the kind described above. Mutatis mutandis, the above statements made with regard to relatively simple hypotheses apply to hypotheses of a more complex structure as well. For these, too, it is often possible to set up probabilistic confirmation criteria, though understandably there will be some complications involved. A hypothesis of this kind must first be explicitated into consequences of a simpler structure. It is then possible to select certain conventional confirmation criteria for each of the more specific consequences. Thereupon, a judiciously chosen combination formula will be devised to determine in which cases the hypothesis is to be considered positively confirmed, in which disconfirmed, and in which the decision is to be deferred. The great importance of prestated confirmation criteria, generally created by deliberate convention, will have become clear by now. Scientific usage requires that the investigator, before starting his investigative procedure, tie himself down to definite criteria for the confirmation and disconfirmation of the hypotheses he intends to test. Thus, on the one hand, 4;1;2
103
4.
F O R M U L A T I O N : B. C O N F I R M A T I O N
he will not be tempted to gloss over, or 'dress up,' his results afterwards, while on the other, the outcome of the entire investigation is molded into a verifiable prediction (cp. 3;2;3). Advantages accrue from the latter aspect also for repeated tests of the same hypothesis on fresh samples (replication). For, the possibility is thus created of making a fresh count (of positive (B) versus negative (non-B) outcomes) — in hopes of finding that 'All A's prove to be B.' In scientific practice, no confirmation argument has stronger probative value than this: that so far a given predicted relationship has been found again and again, without any exceptions. This, too, can be expressed in terms of probabilities. It will be clear that, unlike the falsification of a deterministic, universal, positive hypothesis, the positive confirmation of the great majority of hypotheses of the major types is not logically compelling. Customary confirmation methods result, at best, in a favorable probability statement, though admittedly the latter may be expressed in terms of precise, antecedently fixed criteria. Such a statement, however, can never compel the investigator to regard the hypothesis as valid; in order to become effective, it must be complemented by the investigator's decision to accept the risk of error involved. So long as there is a risk of error, however small and however precisely delineated, the argument in favor of the hypothesis can never be logically 'necessary.' A hypothesis is not proved; optimally, it finds universal acceptance, with the forum. Obviously, such acceptance will be forthcoming most readily when the risk of error is small. But it is impossible to express this in fixed conventional standards. The acceptability of a given risk of error does not depend only on the calculated or estimated magnitude of the risk itself. It also depends on such factors as the content of the hypothesis, its relations to other hypotheses, its place in a theory, i.e., its so-called 'embeddedness.' An interesting example is, again, the presumed existence of parapsychological phenomena (telepathy and clairvoyance). In some investigations, the most stringent confirmation criteria have undoubtedly been met. The possibility of the phenomena encountered being products of chance, rather than effects of extrasensory perception, is exceedingly small (cp., e.g., S O A L and B A T E M A N 1954, p. 311). Nevertheless, the forum debate continues, not least with regard to precognition (i.e. the ability to predict future events), since the contention of the hypothesis seems so hard to reconcile with our ordinary conception of the world. We must add that, in general, the forum debate — if and when this 104
4;1;2
4; 1 C O N F I R M A T I O N O F H Y P O T H E S E S
manifests itself in any concrete form — will be concerned with theories, rather than with solitary hypotheses. Basically, however, the problems encountered in the confirmation of theories (and interpretations, cp. 9 ;2) are no different from those obtaining with regard to hypotheses, although it is true that the process is less transparent, and complications are compounded. One complication is that it is rarely possible to calculate exactly the risk of error involved in the acceptance or rejection of a theory as a whole; another, that the decision — acceptance or rejection — depends on many other factors. For a more detailed discussion, the reader is referred to 4;2. 4; 1; 3 Relevance of predictions
Whenever scientific verificatory procedures are to be carried out in practice, it is of the utmost importance that the investigator make an advance appraisal of the potential confirmation value of the outcome of the prediction which he is about to verify — on two counts: first, with regard to the hypothesis from which the prediction is directly derived, and second, with regard to the theory or theories that he seeks to investigate. There is a variety of ways in which, from one and the same theory, hypotheses can be derived, and the same holds for the derivation of predictions from a hypothesis. The investigator is free in his choice of a particular ramification, and likewise in the design of his testing procedure, provided it is fixed before the prediction is formulated. How can he ensure that the prediction has maximum 'relevance,' i.e., that its outcome has optimal confirmation value with regard to the theory or hypothesis? Again, we shall for the present pass by the technical aspects of this problem — experimental design and so on (cp. 5;1) — and confine our attention to the question of what determines the relevance of a prediction. This, too, is a question that cannot be answered by means of a formula; a few general remarks are in order. A factor of considerable importance is the measure of particularization that has entered into the passage from theory to prediction. As we have seen in 3;2;1, this particularization may be the result of strictly logical deductions (type pi), on the one hand, and of not always 'logically necessary' empirical specifications (type ps), on the other. Consequently, the narrowing of the scope of the original contention may be considerable. The investigator tests only one out of many logical consequences, or he often works with a limited selection of materials, or a narrow operational 4; 1; 3
105
4. F O R M U L A T I O N : B. C O N F I R M A T I O N
definition, etc. — so that the net support contributed to the theory by the outcome is small. For instance, an investigator will work out one particular consequence of the complex theoretical system of psychoanalysis and demonstrate experimentally that, in certain conditions of emotional stress, 'repression' will occur. He has then found something which may be important in itself, but whose significance for the body of psychoanalytic theory is quite limited (cp. H I L G A R D , KUBIE, LAWRENCE, PUMPIAN-MINDLIN 1952, e.g., pp. 36-45, and, e.g., ERIKSEN 1954, on 'perceptual defense'). Another quite obvious factor is the degree of precision of the prediction. If this is small, its fulfillment may be virtually 'meaningless,' that is, it will add nothing new to what we knew already, or might expect on the strength of chance alone. For example, a new economic theory generates a prediction that, in a given year, a given index will exhibit a value between 130 and 140, and this is found true; but it is also found that an older theory, or a simpler theoretical model, would have achieved the same result with an interval of positive verification (cp. 3;4;2) that is no larger. The confirmation value of the positive outcome is therefore small; the prediction was not very relevant. Naturally, the relevance of a prediction will be greater as the special assertion or assumption, which the prediction seeks to test, is more basic to the theory from which it derives. But what constitutes a basic assumption? In terms of the nomological network, an assumption which itself affects many of its deductive strands is more basic than one which does not. If the investigator succeeds in elaborating and testing such a fundamental assumption, the confirmation value of the outcome, and hence the relevance of the prediction, may indeed be considerable, particularly if the latter is not fulfilled. Thus, for instance, anthropometric race theories were built largely on certain skull measurements, which were regarded as constituting reliable race characteristics. A fundamental assumption was that such measurements — being race characteristics — remained, on the average, constant over the generations within one racial group. Studies of immigrants, however, showed that, upon immigration, quite appreciable changes are manifested. Thus, a large portion of the basis of the theories collapsed (see, e.g., FISCHER 1924; SHAPIRO 1939; BOAS 1940). The (negative) outcome had great confirmation value, since the constancy hypothesis was fundamental and the derived prediction relevant. 106
4; 1; 3
4; 2
ACCEPTANCE
AND REJECTION
OF
THEORIES
As far as testing is concerned, another criterion may be applied to assess the importance of a theoretical assumption. An assumption or hypothesis in a theory is called critical, if it is incompatible with an assumption in a rival theory. The areas of conflict between two rival theories often afford good starting-points for the articulation and empirical (experimental) realization of relevant hypotheses. In the ideal case, one prediction must be non-fulfilled if one theory is correct, and be fulfilled if the other is valid. Aside from outcomes falling within the 'no man's land' (cp. 3;4;2), in which case the prediction is considered non-verifiable, something at any rate is refuted, or at least (negatively) confirmed. For instance, theory A generates a prediction of increased achievement — in no matter what field — under a particular condition; theory B generates a prediction of diminished achievement. Or, according to theory A, teaching method (a) is most effective; according to theory B, method (b). If a satisfactory objective criterion can be devised to measure such effectiveness, the methods, and in consequence the theories, can be experimentally matched against one another. Even in the absence of express competition between two theories, the confirmation value of an outcome is largely determined by the contribution it makes to the refutation, or rejection, of alternative hypotheses or theories; compare the earlier discussion of operations involving a null hypothesis in 4;1;2. Accordingly, our conclusion can be phrased thus: a prediction will be the more relevant as its outcome offers better prospects of disposing — by way of refutation or well-founded rejection — of still current (alternative) hypotheses', and the more fundamental these hypotheses are, the better.
4; 2 A C C E P T A N C E A N D R E J E C T I O N OF THEORIES 4; 2; 1 Refutation of theories
In principle, any theory permitting of strictly logical deduction of one or more universal deterministic hypotheses can be refuted or falsified. All that is needed for the refutation of such a derived hypothesis (All A's are B) is that one A is shown to be non-B, and once this strictly logical implication has been refuted, the entire theory collapses. The foregoing will have made it clear that this is in fact the ideal 4;2;1
107
4.
FORMULATION:
B.
CONFIRMATION
pursued: cogent falsification of a theory — by means of one observation, or one experimentum crucis, i.e., one critical, decisive experiment. Since generalizations are generally incapable of empirical proof (positive verification; cp. 4;1;1), the strategy of empirical hypothesis testing must of necessity be bent upon elimination, rejection, refutation — the prototype for which is the single, crucial test. This particular form, however, is of but rare occurrence. As a mode of reasoning, it is transparent enough and indeed practiced frequently: If such-and-such a theory is valid, then this hypothesis must hold: All A's are B's; however, this A is non-B, so the hypothesis and the theory are false. Nevertheless, one would be hard put to it to produce examples where this argument proved really decisive. The simplest cases are no doubt those in which a phenomenon or event occurs which, according to a given theory, ought to have been impossible. Such cases do occur now and then, also in the behavioral sciences. For instance, sociopolitical theories about the democratic system and/or economic ones concerning 'free enterprise,' which imply that in a democracy education is more efficient, science more productive, and the standard of living higher than under a dictatorship — a possible negative case being furnished by developments in, say, Russia. Or, a political science theory concerning voting behavior, from which the conclusion can be drawn that under such and such conditions a given party cannot win an election — when this is the very thing that happens the next time round. Or, one of the many race theories in which, for instance, the fact that (urbanized) European Jews seldom took kindly to agriculture or the army was attributed to race characteristics — clinching counterevidence being provided by the way their children have developed in modern Israel. Similarly, an archeological find will sometimes shatter a cherished historical theory of long standing. Thus, for instance, the recently excavated foundations of a wooden gatehouse in Nijmegen, Holland, dating back to the reign of the Emperor Augustus, necessitated revision of a current theory concerning the Roman past in these parts (van b u c h e m 1961). Critical experiments will be found, for instance, in the psychophysiology of perception. In this domain it has occurred more than once that, as a result of new experimental findings, an older view, an earlier model of the perceptual process, was found inadequate — for instance, Wertheimer's study on the visual perception of apparent movements and 108
4; 2; 1
4; 2
ACCEPTANCE
AND
REJECTION
OF
THEORIES
other early Gestalt psychological experiments1 (WERTHEIMER 1925; cp. also studies on color constancy, e.g., GUILLAUME 1937, pp. 101-105). How is it that such wholesale refutations of a theory by one observed case or one experiment are so relatively rare, not only in the behavioral sciences, but also in the physical sciences? First of all, a number of practical factors come into play here. The observational process — ascertainment that this A is non-B — may be quite complicated. For instance, documents may have to be deciphered, or a large number of data concerning the event to be collected and processed, or the experimental results may require a vast amount of computation, and the like. Furthermore, communication in the scientific world is by no means perfect; other investigators may not (yet) know or fully understand the finding, or they may refuse to attach belief or importance to it. Because of all these factors, the social process by which the scientific world, and thus the 'forum,' must be informed and convinced may be slow and far from spectacular2, also in cases where the conclusion that the theory must fall is inevitable. But — and for our purpose this is more important — the latter conclusion is by no means always inevitable. For one thing, doubts may arise on a variety of points: whether the observations were adequate; whether there was interference by disturbing factors (preventing discrimination between cases b and c; cp. 3;4;2); whether the results were correctly interpreted (was this in fact a non-B case?); whether the case was rightly subsumed under the hypothesis (was this in fact an A?); whether the hypothesis itself was logically derived from the theory, and so forth. 1 As far as experiments are concerned, it will not as a rule be necessary, or even usual, to base the rebuttal literally on one case, since the experiment can be repeated. Still, here too it is fairly common to speak of 'one case on which the theory breaks down' in the sense of: one specific hypothesis or prediction (cp. 3;4;1 sub 4). 2 It might even be maintained that this social process will not materialize unless, apart from the investigator (who publishes his results), there is at least also a promotor, a 'pacemaker' — the same individual or another. Thus, Bernheim's experiments on post-hypnotic suggestion have established that it is possible to make a subject in his normal state, after hypnosis, perform an action without having an inkling of his real motivation, i.e., the suggestion made during hypnosis. In terms of refutation: the old assumption that our actions are either 'meaningless,' or motivated (determined) in a manner that we can know, proves untenable. It has taken a long time, however, before the fundamental importance of this finding has gradually come to be appreciated, mainly through Freud's work (cp. F R E U D 1940, pp. 286-287).
4;2;1
109
4.
F O R M U L A T I O N : B. C O N F I R M A T I O N
Secondly, it is often possible to salvage a theory by introducing some comparatively minor modifications: by so restricting its empirical references that the falsified hypothesis is no longer encompassed; or by postulating some other special condition; or possibly by advancing an ad hoc hypothesis (cp. 2;1;6, footnote p. 43), which takes care of the contradictory evidence, and the like. Particularly when the theory is fairly complicated, it is often possible to dispose of contradictory consequences by repairing the theoretical model in a number of places, or by revising its empirical references. The fact that a strictly derived hypothesis is proved false demonstrates only that the theory is at fault somewhere, but usually does not tell precisely where the rub is. Third, it is often good policy to preserve the theory at least for the time being — evidence to the contrary notwithstanding — even without modifications. This is especially true when the refuted hypothesis does not occupy a position of central importance in the nomological net, so that no fundamental assumption is affected; when the theory has produced highly acceptable results in other respects — and, in particular, when no alternative theory is available. All these considerations apply the more forcibly when we are dealing with a theory of a probabilistic character, so that no single hypothesis can be falsified (4;1;2). Downright refutation of a theory is rare; as a rule, a theory is neither refuted nor proved; it is rejected or accepted, and this will in general involve comparison with other theories (cp. K U H N 1962). Before embarking on a discussion of this topic, we must mention one other form, if not of empirical refutation, then at least of 'absolute' rejection. A theory may have to be rejected because of irreparable formal shortcomings; ambiguous empirical references, logical inconsistencies, extravagantly imparsimonious conceptualization, and/or inadequate testability. Whenever a theory, as it stands, fails to meet one or more of these requirements (stated in 3;1 and to be elaborated in 4;3), absolute rejection may be its lot. In practice, this most often means that the theory 'is not taken seriously' by the scientific community — the forum. Even then it will not die a spectacular death; rather, it will be allowed to languish and sink into oblivion — unless it is formally overhauled by someone who is convinced of the soundness of the basic idea. This process may be protracted, particularly in disciplines and spheres where the formal requirements of the scientific enterprise have not yet been 110
4; 2; 1
4; 2 A C C E P T A N C E A N D R E J E C T I O N O F
THEORIES
accorded general recognition. A good example is the stubborn persistence of Szondi's (gene-psychological) theories, the formal shortcomings of which have repeatedly been pointed out ( S Z O N D I 1947; J A N S S E N 1955; D E G R O O T 1957a). As a rule, a theory will not be rejected —• by the forum — until another, better theory is available, which covers the same general area of phenomena. But when is theory A' 'better than' theory A? Assuming that A' is a modification of A, the question may be rephrased thus: when is such a modification scientifically warranted ? This question can be answered only if the two theories are mutually comparable, not only as regards the areas of phenomena they cover, but also in respect of the stage of explicitation and testing attained — that is, the extent of their available nomological networks. If such is the case, A' may be an improvement on A on one of these three counts: — A' covers more ground than A (e.g., includes A as a special case), while also within the new ground broken by it satisfactory hypothesis testing results have been obtained; — A' produces better hypothesis testing results within the same area; — A' constitutes a simplified logical model (principle of economy, cp. 3;1;3); or on a combination of these grounds. In brief, A' explains more, explains better, explains more simply, or exhibits a combination of these virtues. The possibility exists, particularly in the physical sciences, that, on the strength of these considerations, the question whether A' is superior to A (or vice versa) can be readily resolved, once sufficient and sufficiently reliable evidence is available. Most often, however, a decision cannot be arrived at that easily. Either, the above criteria will lead to conflicting preferences (e.g., because A' explains better but less than A), or, the condition of comparability may not have been met (e.g., because the consequences of A' have not been sufficiently investigated or because the evidence for both theories is still inadequate). Consequently, a great many theories must be considered to have been neither rejected nor accepted by the forum. A' and A continue to exist side by side until, hopefully, their nomological nets have been elaborated to such an extent
4; 2; 2 Relative rejection, and acceptance of theories
4; 2; 2
111
4.
F O R M U L A T I O N : B. C O N F I R M A T I O N
that a definitive decision in favor of the one or the other can be made. 1 Like rejection, acceptance of a theory — by the forum — is mostly based on comparison with other, less satisfactory, theories. If such is the case, that is, if a theory is accepted because it provides the relatively best logico-conceptual model available, its acceptance will as a rule be stamped as expressly provisional. This qualification may be made because, for instance, the nomological network is still too scattered and tenuous; explicit elaboration of the theory's implications not yet sufficiently advanced; or adequate confirmation still lacking. The (probabilistic) risk involved in unqualified acceptance is still considered unduly great. The most that can be decided is that the theory is preferable to other theories — but the possibility is not ruled out, nor the hope abandoned, that a new theory, conceivably a modified form of the present one, will prove more adequate. A theory may be accepted on such comparative grounds in spite of the fact that its nomological net does not merely show gaps but has even produced squarely contrary evidence: strictly derived predictions that have been falsified. As a rule, the tolerance shown on this score will be a function, on the one hand, of the extent to which, despite its shortcomings, the theory is superior to other known theories and, on the other hand, of what hope is still entertained that a better model will eventually be constructed. It is exceedingly difficult to formulate the considerations that are operative here in a concise manner; in the last resort, the verdict rests with the forum, i.e., with the history of the scientific discipline in question. The forum decision may, of course, be one of unqualified acceptance. The hypotheses in the theory then become 'laws' (cp. footnote 2, p. 77), the theory itself is raised to the status of established scientific knowledge — and will probably be taught as 'standard information' in text books. The heliocentric arrangement of our solar system, with planets — the earth one of them — orbiting around the sun, the rotation of the earth and its spherical shape, all this has become accepted 'fact'; it is no longer theory but 'knowledge.' The same applies to the periodic table of the chemical elements and their atomic structures, to the Mendelian 'laws,' and in psychology, for instance, to some of the Gestalt 'laws' of the 1
What happens most commonly, however, is that later a new, third theory is propounded, possibly containing elements of both A and A', which will eventually supplant both. 112
4; 2; 2
4; 2 A C C E P T A N C E A N D R E J E C T I O N O F T H E O R I E S
visual perception of figures. Although these (former) theories and hypotheses have never been strictly verified (4;1) — any more than the proposition that all men are mortal, or more strictly (cp. 3;4;3, ex. 1), that they die within 150 years of their birth — they have nevertheless become accepted as facts. Some of them have received acceptance in spite of evident shortcomings and exceptions (e.g., the genetic laws and the laws of Gestalt psychology). Even unqualified, 'absolute,' acceptance of a theory does not mean acceptance, in this form, as unassailable truth. Modifications, or even a complete reversal of the basic ideas, are not altogether impossible. Einstein's revision of the Newtonian laws of gravitation has already become a commonplace example. Other examples are the negative universal hypotheses concerning the impossibility of spontaneous generation (the generation of living from non-living matter) and the impossibility of E.S.P. (extrasensory perception): two hypotheses towards the refutation of which a great deal of work is being done these days. In other words, even 'absolute' acceptance of a theory or hypothesis — raising it to the status of 'knowledge' or 'law' respectively — means no more than: acceptance as at least part of the truth. In conclusion, it may be noted that when a theory has neither met with rejection nor found absolute or even relative acceptance at the hands of the forum, it may nevertheless be accorded pro tempore acceptance by individual investigators or groups of investigators as a working theory or working hypothesis (cp. 2;2;5 and 2;3;3). Its function, then, will primarily be that of an organizational schema, that is, a frame of reference for a systematic scientific attack on the corresponding phenomena. It is given acceptance and support, not so much because it is impregnable as a theory, but primarily for its heuristic value; that is, because it opens up new possibilities for explicit elaboration and confirmation, and thus gives rise to investigations leading to the discovery of new empirical facts and relationships. These alone will add to our knowledge of the field. In addition, the hope is entertained that the working theory, either supported by, or modified on account of, such facts, may eventually be transformed into accepted theory (cp. 2;3;3).
4; 2; 3
113
4.
F O R M U L A T I O N : B.
CONFIRMATION
4; 2; 3 Theory development
We have noted above the dual function of the working theory or hypothesis. On the one hand, it is susceptible of deductive elaboration, empirical specification (explicit elaboration) and testing, while, on the other hand, it serves the purpose of continued, improved theory construction. If, in addition, we recall our repeated mentions of 'modifications' because of discrepant findings, a fairly telling picture should have emerged of the complexity of theory development. Naturally, the manner in which a theory takes shape cannot be described in terms of a single cycle comprising one phase of theory (hypothesis) formation and one of hypothesis testing. The process of explicitation alone (3;3;1) will, as a rule, call for a large number of tests; what is more, every attempt to modify the model will require a fresh start of the entire procedure. The spiral of progressing scientific inquiry continues to gyrate, and it is only in terms of a succession of recurring investigative cycles that the process of theory development can be described at all. One of its dominant features is the continuous interplay between factual findings and theoretical analyses: frequently a theory will be tested and revised in turn. The ideal pursued throughout is, assuredly, to devise critical experiments (observations) which will permit a definitive choice between competing models — but, most likely, the outcomes of successive tests will, each in their turn, raise new questions and new theoretical problems. Stated more simply, there is a constant search for relevant causal factors. It is sought to obtain these from 'varying experiments' or, whenever experimentation is impossible or unnecessary, from varying tentative interpretations and hypotheses, which are successively tested and evaluated.
In this process, it is not always possible to maintain a strict segregation of explorations and testing. As soon as the investigator seriously contemplates modifying a theory — upon the conclusion of hypothesis testing investigations or possibly while these are still in progress — any outcomes so far obtained will, within this new context, automatically be reduced to the status of explorative findings. Even so, it remains methodologically essential that a strict discrimination be at all times maintained. Modification of a theory — like the special case of the ad hoc hypothesis (2;1; 6) — can never be the final word in any scientific inquiry. The modified theory again needs to be confirmed by strict tests on new materials. Here, too, it is of importance to differentiate between the abstract theoretical model and its empirical references. If a theory is not entirely 114
4; 2; 3
4; 2 A C C E P T A N C E A N D R E J E C T I O N O F
THEORIES
satisfactory, proposals for its modification can, in principle, take one of two forms. Either the investigator can leave his model virtually intact, but so narrow its empirical references that the theory has adequate explanatory power within the more circumscribed scope of reality. Or, he may preserve and even expand the general purport of his theory at the expense of the precision and differentiating power of the model (cp. 2;3;4). In the psychology of learning, investigators have in the main tended to choose the first alternative. As a result, learning theory is studded with models for areas of phenomena that have become more and more circumscribed ( T H O R N D I K E 1932; S K I N N E R 1938; H U L L 1943). In Gestalt psychology, investigators have tended to choose the other alternative. In the early twenties, fairly exact findings had been obtained concerning the visual perception of figures ( R U B I N 1921; W E R T H E I M E R 1923). Through vague generalizations, sometimes contrary to fact ( R E V E S Z 1938, pp. 76-77), these have been over-extended to other perceptual domains — other senses, more abstract (ap)perception, thought processes — so that what is left of the once fairly explicit and differentiated theory is just a handful of vague notions and principles (cp. Revesz' criticism of Guillaume, Koffka, Kohler and Katz, in R E V E S Z 1953). Both solutions are in principle possible. Either way may, for that matter, lead to a 'theory' which is best discarded as it stands: in the first case, because it provides very precise knowledge about almost nothing; in the second, because it affords a very vague explanation for almost everything. 4; 2; 4 Development of theoretical constructs
One aspect of the interaction, between outcomes of tests and (new) theory formation, that deserves special mention is the way in which, in the course of this process, theoretical constructs develop. We have seen that, for the purpose of hypothesis testing, constructs will often be empirically specified by means of operational definitions. The question that must now be answered is: in what manner do empirical results react upon the construct, subsequent to the testing ? And, how will the continued interaction between the construct and findings obtained through its empirical specifications proceed ? This process is of patent importance. Theoretical constructs and theoretical distinctions, both of higher and lower degrees of abstraction, seldom have immutably fixed meanings. Possible surplus meanings aside, their content and meaning are gradually rendered more explicit through 4; 2; 4
115
4.
F O R M U L A T I O N : B.
CONFIRMATION
deployment of an ever-extending nomological net (3;3;2); they will be largely determined by investigative procedures and their results. They take shape and assume content as a result of empirical findings, and they grow together with the nomological net. Sometimes their boundaries become more sharply defined; sometimes they shift; sometimes a concept is discarded or split up; sometimes new constructs are generated. Generally, there are two distinct aspects to the process of sharper delimitation of (the boundaries of) constructs: first, elaboration of the nomological net of the construct, in terms of theoretical relations and deductions, empirical specifications, factual findings (the three types: A, B and C, described in 3;3;3), and second, the gradual elimination of their surplus meaning. Striking examples of this process are afforded by the history of quantitative as well as system constructs in physics: 'force,' 'energy,' as well as 'atom,' 'molecule,' 'light waves,' etc. In the case of constructs or concepts denoting attributes, elaboration of the nomological net will consist largely in the formulation of one or more 1 operational definitions, and in collecting empirical results with the corresponding observable(s). The form which elimination of surplus meaning will then take is that the operationally defined variable gradually comes to be accepted by the forum as an adequate representative of the concept, providing 'complete coverage.' The moment such a condition is attained, there is no longer any surplus meaning. The concept and the variable have to all intents and purposes become identical: the empirical specification is henceforth of the type gs — not involving any loss of generality (cp. 3;2;1 and 3;3;5). A very simple example broadly illustrating the general pattern — the following analysis is not based on a historical study — may be found in the development of a concept like 'fever' before and after introduction of the clinical thermometer. The original clinical notion — considerably older than the thermometer — based as it was on a variety of sickbed observations, no doubt had a vaguer, but also a somewhat different, meaning than the modern one. There was certainly a surplus meaning (still evidenced in some languages by the popular notion of 'cold fever,' i.e., a feverish sensation unaccompanied by a rise of body temperature). After an initial period, in which the thermometer must have been used 1
In the latter case, the corresponding set of variables will take the place of the original concept. A good illustration is to be found in intelligence (tests), which will be discussed presently.
116
4; 2; 4
4; 2 A C C E P T A N C E A N D R E J E C T I O N O F T H E O R I E S
with a measure of justified scepticism — clinical decisions concerning the (degree of) fever presumably not being made on the strength of the thermometer reading alone — the new instrument soon proved so reliable and useful in diagnosis that complete coverage resulted. The instructions of the operational definition now state where, how, and how long the thermometer is to be applied, and what temperature (e.g., 98.6° F — perhaps subject to slight variations according to the individual or the time of the day) is to be reckoned normal; hence any higher temperature read nowadays is simply the fever. The coverage is complete. In the behavioral sciences it is not easy to find instances where surplus meaning wears off that completely. What we do find are numerous cases in which, as a result of operational definitions and empirical findings reacting upon the construct, its content achieves a relative increase in definiteness. We must here leave out of account those cases in which the investigator deliberately adheres strictly to the operational definition of a given construct, i.e., cases where he ignores or bypasses a possible initial surplus meaning (in psychology for instance by introducing intervening variables; cp. 2;3;6 and the literature cited here). Such cases aside, it will frequently be seen that the construct and its operational definition gradually draw closer together; the construct achieves greater definiteness and precision. Thus, there is certainly a clearer picture nowadays of what is understood in differential psychology by a term like 'intelligence,' even though we may not entirely accept the complex operational definition suggested in 3;3;5. This applies likewise to such constructs as 'set' (Einstellung), 'anxiety,' 'extraversion-introversion', the degree of 'cohesion' within a group, 'status,' 'social role' and the like, thanks to the various differing empirical specifications that have been employed for them. If a construct proves empirically manipulable and fruitful, its meaning and content will come to be based more and more on its extending nomological net, and the remaining surplus meaning will correspondingly become less important. This is in fact the ideal development of a theoretical construct. However, it is by no means certain that a theoretical construct-asintended will prove empirically manipulable. If it does not, continued research may then produce a shift of meaning; alternatively, the construct may be split up, or discarded. Differential psychology1 affords a multitude 1 This domain presents an easy choice because of the proliferation of test methods that give rise to operational definitions. But, other areas of science could have been
4; 2; 4
117
4.
FORMULATION: B. CONFIRMATION
of instances. If one were to study the history of 'intelligence,' e.g., since (1870), one would certainly find both a shift in meaning and a relative increase in definiteness, as well as a process of disintegration resulting in a number of now accepted subcategories (THURSTONE and THURSTONE 1941; FRENCH 1951; MCNEMAR 1964; GUILFORD 1966, 1967). Another interesting example is Heymans' 'secondary function,' 1 originally postulated as one of three fundamental dimensions of temperament (WIERSMA 1906; HEYMANS 1932, Ch. 2, I, 2). However, the different operational definitions did not provide sufficient common ground; several 'measures' for the secondary function showed zero correlations (VAN DER VLEUGEL 1939). In consequence, the concept as originally envisaged became untenable. Although it was not discarded in so many words — the forum seldom passes explicit verdicts — it was 'relegated to the background.' All the same, the basic idea proved to have a certain viability; it had been advanced at an earlier date (GROSS 1902), and it kept cropping up later in different guises (LE SENNE 1945). In his efforts to build a dynamic theory of personality round the central concept of 'conditionability,' Eysenck, too, has done something similar. Admittedly, he does not claim direct kinship with Heymans (EYSENCK 1957b), but the content as well as the experimental development of this new concept are remarkably analogous to the 'secondary function' — unfortunately including reports of no correlation (BARENDREGT 1961, Ch. 10). TAINE
It should be added that Eysenck's earlier work was characterized by deliberate efforts to bring about the kind of development (improved definiteness) of constructs described above. His method, based on batteries of objective tests and 'criterion analysis,' seeks to establish exhaustive operational definitions of fundamental personality dimensions like 'neuroticism' or 'extraversion-introversion,' of such a high degree of operational validity that the resulting variables can henceforth be accepted (by the forum) as empirical representatives of these constructs, i.e., shorn of all surplus meaning (EYSENCK 1947, 1952 b; cp. also CATTELL 1946 and 1957).
chosen as well. Instances of construct development, illustrating how empirical procedures and findings react upon the content of constructs, vice versa, can be found in any scientific field. 1 N o t unlike conditionability, 'secondary function' as a personality dimension was supposed to represent the duration of 'mental after-effects' in a person, generally.
118
4; 2; 4
4; 3
NORMATIVE S T A N D A R D S
FOR
PUBLICATION
Another important possibility is, finally, that new constructs are not derived from an antecedent theory, or from a popular notion, but originate from empirical findings. This, also, is a regular occurrence. In the course of a hypothesis testing investigation it may be found, for instance, that a given prediction is fulfilled in one set of conditions, but not in another. This finding is not itself a test result — for no prediction to this effect had been made — but it may generate a process of new hypothesis formation featuring a theoretical notion based on this finding. In this connection may be mentioned attempts to develop fullfledged personality variables from so-called 'response sets,' which were originally conceived of as no more than confounding tendencies on the part of a subject answering questionnaires (e.g., the set to acquiesce), or tendencies to prefer the socially most acceptable answer irrespective of one's own feelings or opinions (see e.g., C R O N B A C H 1950; B A S S and B E R G 1959). The abstract term chosen, and hence the new construct, may or may not hark back to existing concepts or distinctions already employed in older theories or conceptions. An extreme case of exuberant conceptualization may be found in CATTELL (1957). The manner in which he handles factor analysis in personality and motivation research stamps his 'taxonomic' approach as a method for generating new theoretical constructs. Whether or not these new constructs will prove valuable, continued research and the forum debate will have to decide.
4; 3 N O R M A T I V E S T A N D A R D S FOR THE PUBLICATION OF T H E O R I E S AND HYPOTHESES
4; 3; 1 'Testability' necessary and sufficient
In the preceding chapter we have become better acquainted with the deductive process and the principles underlying confirmation procedures, as well as with the requirements to be imposed on these matters from the viewpoint of empirical science. We have seen that constructs are scientifically meaningful only if they can at least, possibly through other concepts, be developed (or explicitated, in network terms) into adequate operationally defined variables. We have seen that 4; 3; 1
119
.
F O R M U L A T I O N : B.
CONFIRMATION
hypotheses are scientifically acceptable only if they can be deductively specified to yield predictions. We have seen that these predictions must be strictly verifiable, as well as relevant with regard to the hypotheses from which they derive. The question now arises whether all these findings enable us to set up more explicit standards for the formulation of theories and hypotheses. We thus return to the subject matter dealt with in 3;1. However, we shall now pose the problem in a more concrete form: What are the standards to be met by the investigator in formulating a theory or hypothesis for a scientific publication ? We shall confine our attention to formulation in scientific publications, since it is only here that the investigator enters expressly into communication with his co-scientists in a manner on which formal demands can be made. In 3;1 we have seen that (the exposition of) a theory or hypothesis must be logically consistent (3;1 ;2), economically formalized (3;1 ;3), and testable (3;1;4), and that it must be presented with stated empirical references (3 ;1 ;5). We are now in a position to reduce these four requirements to one, notably the requirement of testability in a somewhat expanded sense. In our discussion of the principle of testability we have argued that it can be viewed in two different ways: as an absolute minimum requirement — there must at least be some overt links between the theory and empirical procedures — and as a relative quality or virtue which a theory or hypothesis may possess to a greater or less degree. The absolute, minimum, requirement must of course be maintained at all times. The second, relative, conception may be given this reading: Any avoidable hindrance to the most inclusive and varied testing possible is to be regarded as an infraction of the principle of testability. On this interpretation, all three other principles can be subsumed under the one of testability. The term 'avoidable' does not require a separate definition; it stipulates that, in the critical process, any hindrances to testing must be shown to be avoidable. In the present relative application, any infringements of the principle must be established by comparing the theory or hypothesis with an alternative model (with empirical references), either already extant or formulated for critical purposes. So, while the principle of testability cannot be employed 'blindly,' its present formulation does indicate how it can be applied, case by case, to demonstrate that, and how, improvements could have been made. 120
4; 3; 1
4; 3
NORMATIVE
STANDARDS
FOR
PUBLICATION
As far as logical consistency is concerned, the argument is very simple. If there are inconsistencies in the theory, it must be possible to derive, by strictly logical deduction, different consequences which are mutually contradictory — that, after all, is the criterion for the existence of contradictions. To the extent that these exist, the theory also fails to meet the requirement of testability, since it is impossible to make predictions on the basis of incompatible statements. Further, if the formulation of a theory is insufficiently parsimonious, it must contain superfluous concepts and/or statements; superfluous, that is, not logically necessary for the transformation of the theory into testable consequences within the area that it purports to cover1 — that is the criterion for dubbing a theory uneconomical. If such is the case, the theory, as far as these superfluous concepts or statements are concerned, also fails to meet the requirement of testability. Finally, insofar as the empirical references of the theory or hypothesis are not clearly stated, there is no unambiguous indication of the area or universe of phenomena, cases, events, conditions, or persons to which it is held to be applicable. To the extent that there is uncertainty regarding the intentions and pretensions of the theory, its being tested against 'new materials' is impeded: it is not known whether the new materials will fall within the universe, and therefore the theory cannot be adequately put on trial. Accordingly, the principle of testability alone, in a slightly modified version, will suffice. Let us see if we can now formalize this requirement, that is, mold it in formal, for instance logico-syntactic, rules. 4; 3; 2 Different forum conventions
Basically, the requirement of testability, in the form given above, is still very simple. Nevertheless, it requires a judgment to be made 'case by case,' to demonstrate 'how improvements could have been made.' It seems hardly likely that such a principle should be susceptible of formalization. Moreover, in our analysis of the scientific process, particularly in the chapter on confirmation (4;1 and 4;2), we have repeatedly been confront1
The problems involved in deciding the question which of two given logical models that have equal explanatory power is 'simplest,' are left out of account here (cp. 3;1 ;3). I f , however, there is a demonstrable difference in economy, this too will come under the principle of testability.
4; 3; 2
121
4.
FORMULATION:
B.
CONFIRMATION
ed with the impossibility of setting up more than relative requirements. Sometimes, again, having arrived at precise formulations for basic principles, we have had to 'back down' and appeal to the final authority of the 'forum' of scientific investigators. Now that we have come near the end of the road, we cannot all of a sudden, in summing up the foregoing discussion, lay down strict rules concerning the form that a theory must or must not have. Quite to the contrary, the logical outcome of the conception of science consistently put forward here is that we reject the possibility (as well as the desirability) of a logico-analytic language criterion for testability. This means, in effect, that the question of whether the formulation of theories and hypotheses can stand the test of criticism, is in principle relegated to the forum. This, however, is not to say that the discussion may now be regarded as closed. The question is merely modified to read: What requirements must (or can) the forum impose on the formulation of theories andhypotheses from the viewpoint of testability, so as to enable them to carry out their task of critical appraisal essential to the advance of science? We shall find that certain standards can be indicated here. To some extent, the standards which scientists, in their mutual criticism and interchange within a given discipline, expect to be met by their colleagues' and their own work are in the nature of conventions. This is evidenced by the very fact that these standards have varied with the times and with different cultures. Newton is known to have postponed publication of his theory of gravitation, in explanation of the motions of the planets, for quite a long time. His sole motive was that he had not yet succeeded in obtaining evidence that the laws of gravitation developed by him for point masses were, without modification, applicable to homogeneous spheres. Within this momentous and literally worldembracing theory, this was just a detail of mathematical elaboration; nevertheless, Newton postponed publication until he had mastered this as well. It has been remarked that modern theoretical physicists show considerably less restraint in publishing their theories. Conversely, in mathematics we see that the proofs which nineteenth century mathematicians have given for indubitably accepted theorems are nowadays regarded as largely incorrect, as lacking in precision; the forum standards have undergone a change. Likewise, much of the criticism of Freud's work, in particular of his methods of clinical confirmation, stems from the fact that among psychologists, and to a lesser extent among psychia122
4; 3; 2
4; 3
NORMATIVE
STANDARDS FOR
PUBLICATION
trists, there is nowadays a much livelier critico-methodological awareness and insight than Freud himself could possibly have had in his day. The same applies to sociology and anthropology; there too, as the numerous basic studies of methods will testify, the forum has become more critical and harder to satisfy. The upsurge of interest in the critical study of fundamentals, in epistemology and methodology, and in particular the logico-empirical movements within these domains, have made and are making their influence felt. Over against this must be set the still fairly pronounced differences in the criteria for publication observed in different areas of the scientific realm; for instance, between the major groups of the physical sciences on the one hand, and the humanities or cultural sciences on the other. A class apart is formed by the medical sciences, which all over the world labor under the handicap that the training given in medical schools, while certainly strenuous, provides little strictly scientific schooling. As a result, medical researchers frequently still employ methods of confirmation that would no longer be tolerated in other fields of scientific inquiry. This becomes particularly evident whenever medical researchers venture outside the province of the strictly somatic — into preventive and social medicine, or into psychiatry. Here, too, great efforts are undoubtedly being made to step up the standards exacted by the forum — or rather perhaps, to establish a forum endowed with authority 1 — but to date other sciences still have the edge on medicine. Finally, even within one science there will often be differences between different countries, cultures and groups — that is, differences between what might be called local 'forums.' Now, in order to avoid making methodologically unnecessary discriminations between different sciences and cultures, we should try to abstract from what are, for instance, merely Anglo-Saxon or American conventions, or usages found only in the natural sciences. Is it possible to formulate universal minimum requirements, for the present day ? What are, and what are not, reasonable demands with regard to the publication of theories and hypotheses? 1
A special difficulty, aside from training, is probably caused by the social position of medical science. Particularly in an area like mental health, this may easily lead to the emergence of 'rivals' of the forum in the form of largely power-based hierarchies and authoritative bodies, in which the viewpoint of the methodology of social scientific inquiry has but little sway.
4; 3; 2
123
4.
F O R M U L A T I O N : B.
4; 3; 3 jn qUest of
CONFIRMATION
Any requirement to the effect that a theory
minimum requirements
or hypothesis must not be published until its validity has been proved is, of course, untenable. We have already seen that hypotheses, both of the positive, universal deterministic and of the probabilistic type, cannot be proved, but only confirmed (strengthened). Is it then possible to require that the investigator must at least have completed elaboration of the abstract model of his theory in a strictly logical (or symbolic-logical), perfectly integrated form — and, possibly, that he must have provided all the mathematical and/or logical proofs for the relationships assumed within the model (cp. the Newtonian example above)? This, too, is an untenable requirement — however beneficial such stringency in formal elaboration may be — since for a good many theories, not least in the behavioral sciences, strict, let alone axiomatic, structures of this kind are not (yet) the most adequate (cp. 2;3;1). Is it to be required, then, that the investigator shall not proceed to publish his theory or hypothesis unless he has at least substantially contributed to its confirmation by his own research ? Roughly, this is the requirement actually enforced in American academic psychology. 1 Generally speaking, a theory or hypothesis will be published only in conjunction with a report or discussion of the investigative procedures carried out to test it; and these procedures must be of a stringent, and preferably strictly experimental character. No attempt at 'completeness' in the explicitation of the theory, and in following up its consequences, is demanded; formal simplicity, however, is also a prime requirement. A theory that implies far more than what can actually be tested against available empirical materials, is not considered publishable. Theorizing without reference to investigative procedures and empirical results, is unacceptable. So are interpretative reports based solely on clinical experiences, and purely 'verbal' considerations based on empathic understanding (Verstehen) or phenomenological interpretation. This empirical viewpoint undoubtedly has much to commend it. The forum is spared the multitude of gratuitous theories and ever-new programs so characteristic of Continental psychology. The investigator is, as it were, held to the principle: Your promise must be made good. 1
It is enforced to a large extent by the policies of the leading professional journals — the 'Psychological Review' excepted.
124
4; 3; 3
4; 3
NORMATIVE
STANDARDS
FOR
PUBLICATION
Or rather: Refrain from making promises (in your theory), unless you have already done something (by way of testing). On the other hand, there is a danger that in an intellectual climate where this principle prevails great theoretical promises will never be made. In other words, such a climate is not conducive to the flowering of great comprehensive theories or hypotheses. If the principle of economy is too closely adhered to in the forum criticism, there is a distinct danger that scientific production will remain piece work: a mosaic of little bits that are indeed pieced together, but which lacks the bold overall conception, the truly fruitful idea, the touch of genius. In terms of principles, the empiricist parsimony law, if carried too far, would jeopardize the vital freedom of theory (hypothesis) formation. In sum, the empiricist requirement that no hypothesis shall be published, unless the investigator at the same time proffers at least some empirical data by which it has been tested and confirmed, should not be stipulated as a 'must.' While useful and practical in many respects, such a requirement is likely to lead to restrictions that might imperil the scientific enterprise, since they are, in the final analysis, arbitrary. An as yet untested theory or hypothesis, published without new evidence to support it, may spark off meaningful criticism by the forum and lead to fruitful scientific exchanges. 4; 3; 4 Explicitation essential
If no more than minimum requirements are to be formulated, it follows that one may legitimately publish a theory or hypothesis without adducing new empirical data against which it has been tested. But the theory or hypothesis must be strictly 'testable.' If it is not, the way is barred to further exploitation by experimental tests, by replications or supplementary investigations by others; the forum cannot do its work. In connection with our problem this means that the investigator who publishes a theory or hypothesis must demonstrate that, and how, it is testable. The burden of empirical proof does not rest with him, neither wholly nor in part — that would be asking too much — but the burden of explicitating and specifying his theory does. He is bound to indicate at least a number of points on which it can be deductively elaborated to yield verifiable and relevant predictions. Only if he does so, can the forum turn his contribution to good account, either through criticism or through investigative procedures aimed at testing the stated consequences. 4; 3; 4
125
4.
FORMULATION:
B.
CONFIRMATION
The forum criticism, then, might be directed against, for instance, the stated deductive elaboration, the explicitation. Here the principles of 3;1 will be brought into play again. The critic may, for instance, ask searching questions about the empirical references (pretensions) of the theory or hypothesis: To what population does it purport to be applicable (3;1;5)? Or, he may search for inconsistencies and ambiguities in its conceptualization (3;1;2), or for defects regarding its falsifiability (3; 1; 4). He may regard certain constructs, or relations among constructs, as not logically necessary (3;1;3), and thus not empirically fruitful (testable). Again, the critic may be of the opinion that the predictions indicated in the explicit elaboration, even when fulfilled, are not sufficiently relevant to the hypothesis with regard to which they are supposed to provide information (4 ;1; 3). This sort of criticism is frequent in scientific discussions. However, it can be fruitful only if the investigator who publishes a theory or hypothesis has in fact pursued his argument to the point where there are actual testable consequences. As far as opportunities for testing by others — members of the forum — are concerned, it should be self-evident that the author of a theory must explicitate its implied consequences. If this requirement is not sufficiently satisfied, other investigators have no points of application for tackling the theory or hypothesis by empirical means. Should they nevertheless try to explicitate it themselves, and proceed to test the ensuing predictions, the author can — in case of non-fulfillment — always counter with the assertion that these deduced consequences were not intended: 'My theory has been completely misconstrued.' Unfortunately, this type of deadlock in the forum discussion is not unknown in certain areas of social science. Where 'theorists' and 'testers' still live in different worlds, the burden of explicitation may come to rest on the wrong shoulders; it is the theorist's methodological duty to make clear what he means. The question of to what length this should be carried leaves some room for argument. Perhaps it is not necessary to require that the empirical implementation must be worked out in detail by the designer of the theory. But he will have to indicate along what lines deductive and empirical specifications can be obtained which lead to verifiable predictions. 4; 3; 5 Falsifiability
This standard for the publication of theories and hypotheses, naturally, applies not only to their designers but also to their supporters, insofar as these aspire to be included in the 126
4; 3; 5
4 ;3 N O R M A T I V E S T A N D A R D S F O R
PUBLICATION
scientific community. It can be further elaborated in one more respect. In the foregoing we have repeatedly noted that in the scientific enterprise negative confirmation and, where possible, 'falsification' play a very important role. We have seen that a deterministic, positive universal hypothesis cannot be 'verified' (shown to be true), whereas it can be falsified (4;1;1). We have seen that when predictions fail to meet the requirement of verifiability, the core of their defect is most often a lack of falsifiability (3;4;3). We have seen that the existence of a statistical relationship is usually demonstrated by refuting an alternative (null) hypothesis (4;1;2). And we have seen, finally, that the relevance of a — positive or negative — outcome of an investigation is determined primarily by the degree to which such an outcome disproves, or at least throws doubt upon, one or more (alternative) hypotheses (4;1 ;3). Ceteris paribus, a theory or hypothesis is the more valuable as it risks more; its value will reach rockbottom if in the formulation no risk of refutation is incurred at all. This means that in the explicitation required of the designer (or supporter) of a theory or hypothesis special attention must be paid to the possibility of negative confirmation. Anyone publishing a hypothesis should therefore indicate in particular how crucial experiments can be instituted that may lead to the refutation or abandonment of the hypothesis. The author of a theory should himself state which assumptions in it he regards as fundamental, how he envisages crucial testing of these particular assumptions, and what potential outcomes would, if actually found, lead him to regard his theory as disproven. Thus we have after all arrived at fairly concrete standards. These, it is true, do not prescribe anything with regard to the logical form, i.e., to the actual results of the activity of formulating theories and hypotheses, as we might perhaps have hoped at the outset. But they do prescribe in non-formal terms what an investigator is required to do; they do relate to the formulation of theories and hypotheses as an activity in the social field of scientific communication and collaboration. This outcome is in accord with the conception of science advanced in this study as a specific, distinctly social, activity. The normative standards by which this activity is governed derive directly from its goal and from the requirement that the social process must work with a reasonable degree of efficiency.
4; 3; 5
127
CHAPTER 5
F R O M F O R M U L A T I O N TO T E S T I N G AND
EVALUATION
5; 1 D E S I G N O F H Y P O T H E S I S INVESTIGATIONS
TESTING
5; 1; 1 Freedom of choice
In this chapter we shall address ourselves to a survey of what goes on in the empirical scientific process, subsequent to the formulation of a theory or hypothesis, whenever these are to be tested. In terms of the cycle this means that we shall now treat phases three, four and five as a whole. In our discussion the preparations for, and the design of, the testing procedure (phase 3, in the main) will be our chief concern: 5; 1 and 5; 2. The actual realization of the testing (phase 4) and the evaluation of the results (phase 5) will be discussed in 5; 3. The preparations for the testing procedure may be viewed as a series of choices or decisions that the investigator has to make. The first one is naturally that of the subject matter in a broad sense — the theory or hypothesis which he is going to test. Consequent upon this are other decisions. To begin with, it will not as a rule be possible to test the selected hypothesis — much less a theory — as a whole. The investigator will probably have to restrict himself to testing only one or a very few of its logical consequences; logical specifications (particularizations) of type pi (cp. 3; 2; 1) will be called for. He must choose which ramification of the theory he will investigate. Second, it will often be necessary to introduce limiting conditions in the form of empirical specifications of type ps (cp. 3; 2; 1). This necessity will arise, in particular, when constructs as used and intended in the hypothesis are to be transformed into objectively manipulable variables. For each of them, an operational definition will have to be chosen, 128
5; I; 1
5; 1 D E S I G N
OF HYPOTHESIS TESTING
INVESTIGATIONS
such that the construct can be used as a discriminative instrument. This 'instrument' is in fact provided by the operational definition of the variable — which, as we have seen in 3; 3; 4, consists of a series of instructions. If no existing operational definitions (instruments) from earlier work are available, the investigator will have to construct one himself. In other words, the investigator will sometimes have to carry out the instrumental realization of the concept himself and choose a procedure suited to the purpose. Third, the investigator has to make certain decisions with regard to the testing procedure itself. If an experimental test is envisaged, the details of the experiment must be fixed. It may be necessary to delimit or restrict the population; the sampling will call for a number of decisions; details of the experimental proceedings and instructions to the experimenter(s) must be set down in advance; control groups may have to be assembled or control materials collected, and the like. Some of these practical decisions will again entail limiting conditions (particularizations of types pi and ps), others involve no loss of generality (types gl and gs). Considerations of experimental expediency may even prompt the investigator to shift his approach, for instance by tackling another ramification within the nomological net of the same hypothesis. Fourth and last, certain decisions must be made regarding the confirmation procedure, i.e., regarding the manner in which the empirical results are to be processed and analyzed to yield a conclusion concerning their confirmation value for the hypothesis under scrutiny. The most straightforward cases are no doubt those in which statistical confirmation methods are used, involving, possibly, the formulation of a null hypothesis and the choice of a statistical test and confirmation criteria (among which a significance level). All this comes under the heading of preparations. Once these are completed, the investigative procedure can be carried out, followed finally by an evaluation of the outcomes. Obviously, the investigator has a considerable amount of freedom. Even when committed to a particular theory or hypothesis he is still free to choose which consequences he will investigate and in what manner. The various alternatives between which he may choose are not always 'given' in advance; in other words, the investigator can to some extent indulge his creative bent, or at least exercise his ingenuity. Particularly in experimental investigations, the task of designing a 5; 1; 1
129
J. F R O M
FORMULATION
TO T E S T I N G
AND
EVALUATION
sophisticated experimental set-up may be something of an art, calling not only for practical experience (and often for organizing ability) but also for inventiveness and imagination. It is to be noted, by the way, that the imaginativeness required in empirical or experimental matters — the experimenter's art — is of a different character from the theoretician's art, which corresponds to the freedom of (theoretical) design discussed in Chapter 2 (cp. esp. 2; 1; 2). Here the two aspects of the scientific enterprise, the logico-theoretical and the empirical-factual (cp. e.g., 2; 2; 1), are clearly brought out again. Skill and mastery in the one area do not always go together with proficiency in the other. There are pre-eminent theoreticians as against notable experimenters. In physics this duality has even found recognition in a systematic dichotomy (experimental and theoretical physics). This is not the case in the behavioral sciences1, but there too the distinction may be useful. Even in the non-experimental cultural sciences, the creative gifts of the thinker, the theoretician, may be contrasted with the inventiveness of the investigator or observer in the field. To the extent that the investigator has a free hand in organizing his testing procedure, — to the extent that hypothesis testing (experimentation) is in fact an art — strict standards for what is, or is not, permissible are no more applicable than in hypothesis formation. The most that can be done is to make some recommendations, sketch certain possibilities, and discuss limitations of this freedom of choice. These subjects will be dealt with in the ensuing paragraphs. 5; 1; 2 Considerations pertaining to confirmation
The investigator's freedom of choice in designing the set-up of his empirical testing procedure is no more unlimited than is the freedom of theoretical design (2; 1; 2 ff.). Certainly, the entire process up to the actual test may be described as a series of choices, a series of decisions — but these must be sound and judicious ones. At every choice point there are usually a number of superior and inferior alternatives (ramifications); the investigator must consider and weigh 1
The term 'experimental psychology' used to be employed to mark the contrast with 'speculative (i.e., non-empirical) psychology.' Although the term is still used occasionally, generally in the sense of: psychology of (cognitive) functions, it can no longer be interpreted as distinguishing those psychologists who use the experimental method from those who do not.
130
5;1;2
5; 1 D E S I G N O F H Y P O T H E S I S T E S T I N G
INVESTIGATIONS
a variety of criteria before he can make a judicious decision (cp. NEWCOMB in his introduction to FESTINGER and K A T Z 1953, p. 1). A group of considerations of great moment for such decisions has to do with efforts to ensure optimum confirmation value of the empirical results to be obtained. Preparations for hypothesis testing must include not only advance arrangement of the details of the testing procedure proper (cp. 5; 1; 3), but also advance consideration of the procedures to be employed for confirmation and evaluation. The crucial question is to what extent the possible outcomes of predictions, as they are produced by the experimenter's preceding choices, warrant conclusions with regard to the hypothesis (or theory) that is being tested. This question falls apart into a number of different, although interrelated, considerations concerning the confirmation process. First, how relevant is the deductively derived experimental question (the prediction and its possible outcomes) with regard to the theory or hypothesis to be tested? Is a fundamental assumption involved? Is the 'logical distance' to the theory or hypothesis no greater than is strictly necessary? Is the deduction itself justified beyond debate, i.e., does the prediction really follow from the theory? What will be proved, and in particular what will be disproved, by a particular finding (cp. 4; 1; 3)? Second, the empirical question has been derived from the hypothesis not only by logical steps of inference, but also through empirical specifications (cp. 3; 2). Are these acceptable ? In particular, are the variables-asdefined still sufficiently representative of the concepts-as-intended (cp. 3; 2; 1, 3; 3; 4 and 4; 2; 4)? Do the operational definitions adequately cover the constructs? Is the choice of methods and instruments appropriate? And, are these instruments objective, reliable and valid (cp. Chs. 6, 7 and 8)? In the social and behavioral sciences, the investigator himself will often have to provide instrumental realization for some of his constructs: are these provisions satisfactory? Is the variable still a sufficiently adequate representative of the construct, so that afterwards generalization from the findings with the operationally defined variable will lead to a sufficiently definite conclusion concerning the construct? It goes without saying that advance analysis of all these factors is of great moment for the confirmation value of the experimental outcomes. Third, it is important for the investigator in advance to form a clear idea of the possible alternative (theoretical) interpretations of fulfillment (or non-fulfillment) of the prediction. This is mainly a question of whether 5; 1; 2
131
5.
FROM
FORMULATION
TO T E S T I N G A N D
EVALUATION
other theoretical models could provide an equally satisfactory explanation for the observed outcome. In other words, does the experimental design of the test discriminate with adequate precision between different theories (cp. 4; 1; 3)? A special case where it is particularly important to foresee developments has been mentioned before (cp. 3; 4; 2): interpretation of the outcome as due to a so-called disturbing factor. Schematically, this may be illustrated as follows. Suppose one wants to demonstrate a statistical relationship between A and B, say, that there are relatively more A's than non-A's that are B. Suppose, then, that in the testing procedure significantly more A's than non-A's are in fact found to be B. This finding, however, is now interpreted as due, for instance, to the fact that in the test sample — but not in the population — the A's were at the same time relatively more often C; while it is known, or likely, that C and B are in some way connected. Therefore, the finding has no significance for the relationship between A and B in the population, which it was sought to demonstrate. Stated otherwise1, because of the disturbing factor, i.e., the distorted sample, the verifiability conditions for the prediction derived from the theory under scrutiny were, according to this criticism, not satisfied (cp. 3; 4; 2). Generally, the investigator ought to foresee such possible (c)-interpretations and to try to eliminate them in advance by modifying his design. In the behavioral sciences such advance elimination is in fact one of the major concerns of the investigator. As we shall see in the remainder of this chapter and also in Chapter 6, his chief care will be for objectivity, since the disturbing factor will most often be: contamination of the sample or of the empirical materials by 'subjective factors.' It is essential to so draw the sample or 1
It should be noted here that one and the same 'disturbing factor' may be variously described according to the point of view taken and the aspect emphasized: 'the (theoretically derived) prediction was not verifiable'; 'the factual prediction was indeed verifiable but not relevant'; 'the prediction failed to discriminate between two theoretical models or modes of explanation'; 'the sample was distorted'; 'the data were contaminated'; or, possibly, if the latter defects on this level could have been remedied: 'the experimental conditions were not formulated with sufficient precision.' This multiplicity of descriptions corresponds to the multiple possibilities of improving the test design — among which a judicious choice must be made. We are here dealing with a special case of a general phenomenon: whenever there is something amiss in a testing procedure — including the case that a hypothesis derived from a theory is disconfirmed —• there is room for different views as to the root cause of the trouble. Hence repairs (modifications, cp. 4; 2; 3) can also be made in a corresponding number of places.
132
5; 1; 2
5;1
DESIGN OF HYPOTHESIS TESTING
INVESTIGATIONS
samples and to so arrange the conditions of the investigative procedure that such alternative interpretations are obviated. Among the considerations bearing on confirmation, a fourth type is of a more technical nature, namely statistical considerations. A judicious answer to the questions concerning the population to be sampled and the sample to be drawn can often be obtained only on statistical grounds. How large should the sample be (i.e., how many cases) to enable sufficiently certain conclusions to be drawn by means of the (statistical) processing envisaged? What statistical operations are to be carried out, i.e., what statistical test will be most adequate? What significance level is to be adopted, and is a one-tailed or two-tailed test to be used ? Obviously, among the considerations determining procedural decisionmaking in hypothesis testing, statistical ones play a prominent part. Admittedly, a good deal of the attention that books on experimental design bestow on statistical considerations (cp. e.g., E D W A R D S 1956) is due to the fact that these lend themselves better to a technical, specialized treatment than the more qualitative considerations discussed so far. But since they also are of indubitable importance, the investigator will be well advised to work out in advance the entire procedure of processing and statistical testing and to analyze, for each possible outcome, how the confirmation and evaluation processes are to be conducted. Before the actual test is begun, all this must have been thought out in detail, and the procedures to be adopted must have been fixed in advance. 5; 1; 3 Practical considerations
Another group of considerations that limit the investigator's freedom of choice is of a more practical nature. In part these have already been included in the above discussion, for instance, when reference was made to the choice of an 'efficient' procedure. Efficiency, it will be clear, implies that the investigator should attain his goal with a minimum of effort in terms of time and money spent, number of experimental subjects, instruments used, etc. This thoroughly practical aspect is so obvious that little further comment is required. The same applies to such limitations on experimentation as are imposed by money allocations, the skills and abilities of the available staff, the accessibility of materials and data, and the like. An important practical consideration concerns the availability of good instruments (objective, reliable, and valid) for determining the variables 5; 1; 3
133
5.
FROM
FORMULATION
TO TESTING
AND
EVALUATION
required: psychological tests, including instructions for data processing, standardized scales, questionnaires, etc. True, the investigator himself will sometimes be able to create an adequate instrumental realization of his constructs, but quite often this would in itself require an ('instrumental,' cp. 9; 1; 3) investigation of such proportions that it is out of the question. In that case the investigator will have to fall back on existing, properly standardized instruments — which may not be adequate. Finally, the availability of mechanical or electronic aids and computer programs for data processing is an important factor, particularly in areas of research where elaborate statistical processing techniques (e.g., factor analysis) and/or large numbers of cases are in order. Simple and obvious as the fact may be that such practical matters affect procedural decision making, the manner in which they make their effect felt in the practice of scientific research is often irrational and hard to gauge. An investigator's success may depend in large measure on his skill in handling 'practical factors,' such as obtaining financial support; cooperation with other experts and enlisting the help of specialized institutions; tapping resources and securing technical aids; and adapting the research program to the available funds, resources and staff. Anyone unaccustomed to modern technical aids (tests and other standardized instruments, optical scanners, statistical techniques, computer programs) will often fail to incorporate them in his research plan or to make adequate use 1 of them even when they are available; anyone familiar with them may tend to overlook more direct, simple procedures. This is hardly the place to launch into an extended discussion of prevailing research shortcomings on this level. There can be no doubt, however, that these practical, or perhaps rather, technical considerations deserve to be given far more attention in the planning of research projects (hypothesis testing investigations and otherwise) than they often receive, at least in Europe. Another practical concern of importance is that of incorporation of one's research efforts within a larger master plan and/or coordination with the efforts of others. This too is a matter of efficiency, which is often badly neglected — not only in Europe. Investigators are often 1 Tyros tend to make the mistake of expecting prodigious results from machine processing. They fondly assume that large, uncritically collected, 'interesting' materials not based on any clearly formulated problem will, of themselves, blossom forth into meaningful hypotheses and test results: most literally, deus ex machina.
134
5; l ; 3
5; 1
DESIGN OF HYPOTHESIS TESTINO
INVESTIGATIONS
inclined to pursue only their own theories or to keep flitting from one 'interesting' new thing to another, when they would be better engaged on such down-to-earth tasks as constructing (or adapting and standardizing) good instruments for research, or replicating earlier studies with fresh samples. In the behavioral sciences efficient research cooperation — completing by a concerted effort an already designed and partially erected edifice — seems to be the exception rather than the rule; instead, there is a tendency to 'lay foundation stones' over and over again. These too are practical considerations, which deserve more attention than they are usually given. 5; 1; 4 The importance of advance analysis
The above description of the preparations for hypothesis testing as a series of choices or decisions concerning subproblems of an overall objective may have put the reader in mind of Chapter 1 (and in particular 1; 2). In point of fact, the planning of a testing procedure is not very different from planning and directed thought in other areas of endeavor. The crucial point is that the researcher should look ahead and analyze consequences in advance. This is what the chess player does when he designs a strategy or calculates several moves deep to determine his choice of a particular move. The researcher, however, has a distinct advantage in that he can go beyond 'mental trials' (1; 1; 4). He can work out things on paper and try out a variety of exploratory approaches, and he can seek advice — aids which the chess player is denied, at least during play. To ensure an optimally organized testing procedure, these aids should be exploited to the full. Because of the investigator's freedom of choice, and in particular because of the great diversity of research problems and conditions, there can of course be no question of formulating strict normative standards for this advance analysis. What can be done, again, is to mention various possibilities and make some recommendations. Foremost among these latter is the recommendation to work out in advance the investigative procedure (or experimental design) on paper to the fullest possible extent. This 'blueprint' should comprise: a brief exposition of the theory; a formulation of the hypothesis to be tested; a precise statement of the deductions leading up to the predictions to be verified; 5;l;4
135
5.
FROM
FORMULATION
TO TESTING
AND
EVALUATION
a description of the instruments — in the broadest sense — to be used, complete with instructions for their manipulation; detailed operational definitions of the variables to be used; a statement about the measurement scales (nominal, ordinal, interval, ratio) in which the respective variables are to be read (cp. 7; 2; 2); a clearly defined statement of the respective universes to which the hypothesis and the concrete prediction(s) apply; an exact description of the manner in which the samples are to be drawn or composed; a statement of the confirmation criteria, including formulation of null hypotheses, if any, choice of statistical test(s), significance level and resulting confirmation intervals (cp. 3; 4; 2 and 4; 1; 3); for each of the details mentioned, a brief note on their rationale, i.e., a justification of the investigator's particular choices. According to the nature of the case, the written plan can then follow up these procedural points by a more or less detailed advance discussion of what is after all the most important aspect of wider import: the confirmation value of possible findings with respect to their generalization in theory and, possibly, in practical application. That all this should be decided and set down in advance is a demand inspired not only by considerations of a practical and administrative nature. Efforts to ensure a smooth and, as far as possible, flawless realization of the testing procedure are of course important in themselves. More important, however, is the use of the preliminary versions of the research plan as a working paper. Only if all the successive steps are thought out thoroughly because they have to be written down, will weak spots in the experimental design, ambiguities with regard to confirmation (5; 1; 2), and problems of practicability (5; 1; 3) be clearly brought to light. The investigator can then attempt to remedy them beforehand, rather than wait till he is confronted with the fait accompli of an abortive investigation or an outcome that allows no clear interpretation. In some cases it is essential to work out in advance the consequences which a particular empirical outcome — if found — would have for the theoretical model that the investigator believes to be suited to his purpose. To this end he may have to go into details of a mathematical nature. A good example is the policy adopted by Clyde H. Coombs in his research on the psychological utility functions that determine human behavior 136
5; 1; 4
5; 1 D E S I G N O F H Y P O T H E S I S T E S T I N G
INVESTIGATIONS
in betting and games of chance (COOMBS 1958,1967). Before an experimental set-up was decided on, the question would be asked again and again, what kind of (mathematically formulated) model would correspond exactly with a particular, empirically possible, behavior pattern of his subjects. Only when this had been clearly analyzed, that is, when the experimental design had been made such that the outcomes discriminated sharply between different acceptable and clearcut utility models, would the go-ahead be given for experimentation. Unfortunately, such deliberations and discussions prior to the investigative procedure are hardly ever given more than a passing mention in the published report of the investigation — all the more reason to mention them here. After everything that has been said above, the significance that mutual criticism and discussion and the exchange of ideas have for working out a good experimental design should hardly need stressing. One form of exchange is consultation of individual experts, another is group discussions with research colleagues and, in the academic world, with assistants and advanced students. In this respect European research practice generally lags behind that of the United States — not only in the behavioral sciences. Critical group discussions are naturally most necessary when the research project is a combined effort of individuals or institutes. In these circumstances the entire experimental design must be analyzed, to the point where agreement has been obtained that an investigation of this particular form is both meaningful (5; 1; 2) and well-planned, so as to be capable of smooth execution (5; 1; 3). But also when the investigation is not a cooperative effort which calls for executive meetings, discussions with research colleagues are useful in this stage. They should produce either an approved or revised experimental design or, possibly, a wellfounded decision that an investigation on the suggested lines is not meaningful or cannot be adequately realized. In the training of behavioral science researchers, systematic discussions of this kind are indispensable. Finally, an important device for securing a good investigative design is the empirical (experimental) preliminary investigation. Particularly in large-scale, costly investigations it is essential to make sure in advance that risks of failure or ambiguous results are as far as possible eliminated. This can often be achieved by a small-scale pilot investigation, in which the entire set-up is given a trial run. Its purpose is not to obtain final results, but rather to see whether 'things are working out,' whether the 5; 1;4
137
5. F R O M F O R M U L A T I O N T O T E S T I N G A N D
EVALUATION
various parts can be carried out in practice, whether the situations or conditions that are to be created and compared experimentally do in fact materialize, and so forth. Sometimes the preliminary investigation will be aimed only at certain critical parts. For example, if in a group experiment the objective is to analyze the behavior of subjects who feel 'left out by the group,' a pilot investigation may be used to make sure whether those subjects really feel 'left out.' That is, does the device envisioned for this purpose, e.g., deliberately frustrating behavior by stooges within the group ( H U T T E 1953, p. 15 ff.), actually produce this effect ? The required extent and scope of such preliminary investigations will vary with the subject matter, but it is generally safe to recommend as a minimum requirement that the research method be tried out at least on a number of cases to see whether 'it works.' This will very often lead to improved instructions, to the removal of ambiguities and misunderstandings and the like — and of course it may lead to the exceedingly important conclusion that the experimental design is inadequate and must be radically altered or else abandoned. This section is marked by a strongly argumentative, almost propagandists tone. The reason for this is not hard to find. In research in the social and behavioral sciences, and in Continental research in particular, the importance of thorough preparations is all too often underestimated. There is undue eagerness to 'get the thing started' or to obtain results, and there is often too much fear of criticism, preventing consultations with experts and colleagues. Especially in applied areas — educational, industrial, clinical — the result is a spate of research undertakings, both large and small, which through faults in their design are theoretically insignificant and/or practically all but worthless. It is of the utmost importance that researchers and future researchers be trained in the techniques of experimental design described here. In addition, research promoters — foundations, government agencies — who provide the financial wherewithal, should be aware of their significance.
138
5;l;4
5;2 FROM F O R M U L A T I O N TO T E S T : AN EXAMPLE
5;2 F R O M F O R M U L A T I O N T O T E S T : AN E X A M P L E 5; 2; 1 Psychosomatic specificity
The process in the third phase, viewed as a series of well-considered choices by way of deduction and specification, can best be elucidated by means of a concrete example. For this purpose we have chosen a study in the field of psychosomatics, Barendregt's dissertation: 'The hypothesis of psychosomatic specificity tested against the Rorschach responses of patients suffering from asthma bronchiale' (BARENDREGT 1954). In this report, as usual, there is barely any mention of the preparatory work underlying the experimental design (cp. 5; 1; 4), in the form of pilot investigations, trial analyses, discussions in the psychosomatic study group (then headed by J. Groen) and with the sponsor of the dissertation. These details, it will be clear, are not needed for the logical dovetailing Barendregt has evidently aimed at in his study. What is needed to achieve this end, and what is actually worked out here in exceptionally clear and instructive detail, is an account of the successive decisions (deduction and specification steps) which have eventually led to this particular design of the experiments. These decisions — and the corresponding narrowing of the scope of the problem — will be discussed below with respect to one of the hypotheses tested by Barendregt1. Just a few brief comments will be given. The reader is invited to forge his own links with the subject matter of 5; 1 by analyzing for himself the various steps and the considerations on which they are based.
The term psychosomatics, which in itself suggests interaction between psychic and somatic phenomena, is generally used with particular reference to the study of etiological factors of a psychological nature in the pathogenesis and/or development of somatic disorders (GROEN, VAN DER HORST and BASTIAANS 1951). The existence of such influences is presupposed. In particular, it is postulated that they are highly instrumental in causing a certain class of disorders, which are accordingly called psychosomatoses: colitis ulcerosa, ulcus ventriculi, asthma bronchiale, and others. 1
To facilitate our exposition, a slightly different arrangement has here been adopted than in Barendregt's dissertation; aside from this minor difference, the main lines of the argument are identical.
5;2;1
139
5. F R O M F O R M U L A T I O N T O T E S T I N G A N D
EVALUATION
The mechanism which is thought to govern the operation of psychic conditions on bodily functions has been the subject of a great deal of theoretical discussion, based in part on some fundamental experimental studies (e.g., CANNON 1929 and 1936). The crucial assumption at the root of the theoretical argument is that emotions and tensions can produce functional changes in the endocrine-vegetative system, which will in particular affect certain organs; and that, if the underlying conditions of emotional stress are prolonged or frequently repeated, these functional changes or malfunctions may result in organic disorders (VAN DE LOO 1952, p. 61). The man in the street will then say, for instance: 'he has a nervous stomach' (i.e., the patient has developed gastric ulcers as a result of emotional stress). A number of investigators have, as a first specification of this general theoretical argument, put forward the so-called hypothesis of psychosomatic specificity. This assumes a specific relationship between, on the one hand, the nature of the psychic stress situation and, on the other hand, the organ(s) which will be affected, and hence the psychosomatic syndrome that will result. Now, 'the nature of the psychic stress situation' in which a person finds himself is to be regarded as depending both on his outward circumstances, that is, on the nature of the 'situational stress' to which he is exposed, and on his personal reaction to it, that is, on his 'personality disposition.' Accordingly, the following three readily deducible consequences present themselves for investigation (GROEN, VAN DER HORST, BASTIAANS 1951): 1. Specific personality factors determine an individual's susceptibility (predisposition) to specific psychosomatoses (only specific personality types are susceptible to the specific stress situations that cause the disorder); 2. Specific types of external stress situations correspond with specific disorders, which they precede (only specific external situations are capable of causing the specific conditions of stress); 3. The manner in which an individual of a specific personality disposition experiences and reacts to the situation in which he finds himself is peculiar to a specific disorder (cp. BARENDREGT 1954, p. 3). The consequences stated under 1 and 2 bring us only within hailing distance of the study of the pathogenesis of the disorder. However, they are more readily amenable to empirical investigation than the third 140
5; 2; I
5; 2 F R O M F O R M U L A T I O N T O T E S T : A N
EXAMPLE
consequence, and they lend themselves fairly well to what might be called 'existential proof' of at least a certain degree of specificity in psychosomatic etiology. So much for the hypothesis of psychosomatic specificity as conceived by the study group directed by Groen. Any presentation of the background and content of the hypothesis — or rather theory — is naturally incomplete in the absence of a more detailed account of what is understood by 'specific' personality factors, 'specific' situations, etc. Such an account can be found in the relevant clinical literature (e.g., ALEXANDER 1943; GROEN, VAN DER HORST and BASTIAANS 1951), albeit in terms that are descriptive rather than operational. For our purpose there is no need to go into details here. 5; 2; 2 Step by step specification of the problem
Barendregt's research was concerned solely with the consequence stated under 1, that is, with the specificity within the 'personality structure.' This is a first, logical, restriction; a 'particularization' of type pi (cp. 3; 2; 1). This choice was made because, as a psychologist, Barendregt wanted to use psychological tests as criterion measures. Since, however, there are formidable difficulties in the way of operationalizing a concept like 'personality' in an adequate and objective manner, the first additional restriction required was the selection of particular personality traits. These had to be such that they were, first, determinable with sufficient adequacy and objectivity by psychological tests and, second, derived logically from the available body of theory formation concerning psychosomatic specificity. This theory still consisted mainly of general descriptions — based on a fairly extensive collection of clinical case studies — of personality dispositions correlating with specific disorders. Although these descriptions were rather vague, and different authors were not always in agreement, it was nevertheless imperative to utilize to the fullest possible extent the clinical experience and tentative theory construction embodied in these efforts. Personality traits there given prominence had to be selected. Whenever a characteristic or personality trait is operationally defined by means of test variables, the subsequent investigation will assume the nature of a search for the occurrence of certain patterns of test behavior. Considering that Barendregt concentrated his efforts on asthma, and for the purpose of coordination with earlier investigations chose the Ror5; 2; 2
141
5. F R O M F O R M U L A T I O N T O T E S T I N G
AND
EVALUATION
schach as his instrument, the objective, at this stage of specification, can be defined as follows : To test the hypothesis that certain behavior patterns, which can be adequately and objectively recorded by means of the Rorschach test, are, in consonance with the clinical theory, characteristic of patients suffering from asthma. It will be apparent that the process of particularization has by now reached a fairly advanced stage. So far the specifications are in the main of type pi: specifiable personality characteristics (traits, behavior patterns) may be regarded as part of the personality structure; traits determinable by personality tests as a subset of all the characteristics deducible from the theory; Rorschach test variables (behavior patterns), in their turn, as a subset of the last one. The choice of asthma among the psychosomatoses, finally, again constitutes a logical particularization (type pi). The considerations on which these choices were based were mostly of a practical nature (5 ; 1 ; 3) : restriction of the subject matter, because it is impossible to investigate everything at once; tests as the most appropriate instrument for the psychologist in the research team; the Rorschach for the sake of continuity with earlier investigations; and, similarly, asthma because a fair volume of work had been done on this disorder and because financial support was available for this particular field (op. cit., note opposite p. 1). 5; 2; 3 Empirical specification of concepts
Elaboration of this objective inevitably leads to further empirical specifications. For one thing, the 'specific' trait concepts must be selected and operationally defined or instrumentalized by means of objectively determinable Rorschach behavior patterns. To simplify matters, we shall from here on confine our attention to one of the seven (sub)hypotheses tested in this study, namely Barendregt's sixth hypothesis (op. cit., p. 20). The personality dispositions of asthmatics are, according to the theory, characterized by hostile aggression, as distinguished from the aggression of ulcus ventriculi or duodeni patients, which, according to G R O E N (1947, 1950), is indicative, rather, of a competitive disposition. Asthmatics are reputed to harbor hostile wishes more often or more strongly than others, but they suppress or repress them. This is supposedly one of the 142
5; 2; 3
5; 2
FROM
FORMULATION
TO TEST: AN
EXAMPLE
factors responsible for the feeling of constriction or oppression characteristic of asthmatics, also outside their attacks (Barendregt's fifth hypothesis, op. cit., pp. 42-43). 'While we must assume,' so Barendregt argues (op. cit. p. 20), 'that these wishes are repressed, so that they hardly manifest themselves in daily life, there must, if they exist, be areas of behavior in which they do manifest themselves. One of these areas is to be found in Rorschach responses, since both conscious and unconscious, both manifest and latent, wishes can reveal themselves there.' Let us take a closer look at this argument. The first step, 'there must be areas of behavior in which these wishes do manifest themselves,' is evidently based on the principle of testability (cp. 3; 1; 4 and 4; 3; 1). This is here manipulated in the following manner: it is meaningless to hypothesize repressed (unconscious) wishes unless it is at the same time assumed that these wishes will somehow manifest themselves in the patient's behavior; some form of factual manifestation must be possible, otherwise the hypothesis is not testable — and hence worthless. In view of the principle of testability this inferential step is unassailable, and it involves no loss of generality. It may be regarded as a purely logical step of type gl (cp. 3; 2; 1). Alternatively, it may be considered the first step toward empirical implementation that must be taken in every investigation (type gs) — even if, strictly speaking, there is as yet little question of 'specification.' But Barendregt's argument, as quoted above, goes beyond this point. He assumes manifestation as (a) wishes, and (b) in the Rorschach. Assumption (a) hardly constitutes a restriction, in view of what, under the influence of psychoanalysis and other depth psychological ideas, has become the present broadened usage of terms like 'wish.' Once it is accepted that a hypothesized 'unconscious wish' is indeed a wish, and that the often highly indirect and sometimes symbolic behavior manifestations that correspond with such a hypothesis are expressions of this wish as such, then the condition 'expression as a wish' allows quite an elastic interpretation. It then means no more than: in accordance with theories and views current in clinical psychology regarding what constitutes a wish and how it can manifest itself1. From this viewpoint, (a) does not imply a new spec1 This usage is no doubt open to objections. The hypothetical character of the assumed existence or operation of unconscious or repressed wishes is obscured by it. This, however, is irrelevant to the present discussion, since Barendregt conforms to the current usage.
5; 2:3
143
5.
FROM FORMULATION
TO T E S T I N G A N D
EVALUATION
ification; assumption (b), on the other hand, does. Barendregt himself indicates this in his formulation: 'one of these areas (in which wishes can manifest themselves) is to be found in Rorschach responses.' This means that there may be other areas as well; and that the Rorschach is one of these areas, is assumed by him. This, therefore, is a 'particularizing' empirical specification step (type ps), which is not logically 'necessary.' In principle, psychological (theoretical and/or empirical) arguments may be advanced against the view that this type of 'repressed wishes' must, or even can, manifest themselves in the Rorschach — a matter on which the forum must pronounce. If we accept Barendregt's argument, the point to be demonstrated is that asthmatics produce a marked number of responses in the Rorschach that have a hostile content. What we need here is a criterion — when is a response 'hostile'? — and a method of scoring which will provide an index of hostility to be used in comparing asthmatics with others. In other words, the notion of 'harboring hostile wishes' must be empirically specified to the point where an instrument will be obtained that embodies an objective operational definition of the corresponding variable1. The measure chosen by Barendregt was the index of hostility constructed by E L I Z U R (1949), which had been recommended and used earlier in Rorschach content analysis. In fact, the availability of this instrument had been one of the considerations, in this case of a practical nature (5; 1; 3), which determined the choice of this particular hypothesis. Elizur's index is based on a simple count of the number of Rorschach responses that can be regarded as 'hostile' according to certain carefully defined criteria. The index was certainly not an ideal instrument from the viewpoints of objectivity, accuracy, stability and adequacy (cp. Chapters
1 Sometimes three terms are almost interchangeable: a concept is 'empirically specified,' 'instrumentally realized,' 'operationally defined." The differences in meaning — partly in emphasis — will have become clear by now. Instrumental realization implies the making of an instrument (cp. also Chs. 6, 7, 8), which then embodies an operational definition of the concept in the shape of an empirical variable. Empirical specification is effected through specification steps (gs and ps), which need neither in themselves nor together be sufficient for a complete operational definition; in addition there may be needed, for instance, instructions for the scoring and calculation of the variable. An operational definition, for that matter, need not contain any empirical specification step at all: a series of calculational instructions may in itself be considered an 'operator' ( b r i d g m a n 1928), which defines a mathematical concept (cp. also b e r g m a n and s p e n c e (1941) 1956).
144
5; 2; 3
5; 2
FROM FORMULATION
TO TEST: AN
EXAMPLE
6 through 8), but we shall not here go into details. Like Barendregt himself (op. cit., p. 43), we shall content ourselves with the statement that others had worked with this variable, and that positive results regarding its reliability and validity had been reported. A few remarks must be made, however, about the question of whether the investigator has made a justified and meaningful choice in electing to work with a far from ideal operational definition of hostility. In Barendregt's case, this question can be answered in the affirmative. At this stage of inquiry into the hypothesis of specificity, the main purpose was to demonstrate the existence of differences between asthmatics and others. A significant statistical difference in the predicted direction, with regard to one or more variables, would itself be of importance, irrespective of the question whether these variables were perfect representatives of the — still rather vague — theoretical concepts. Construct validity of the variable (cp. 8; 2; 3) was not such a pressing problem asitmaybe in the case of more refined theories and constructs with a more fully elaborated nomological network. As for reliability, even when it is no more than moderate, it is possible to demonstrate statistical differences between fair-sized samples. In other words, the main concern was to choose, roughly in accord with the theory, a practicable instrument capable of demonstrating differences between asthmatics and others — regardless of their precise psychological meaning. In view of this relatively modest confirmation purpose, the instrument chosen is satisfactory, and the appeal to the fairly successful work done with it by others is to be regarded as valid. 5; 2; 4 Experimental design-, further specifications
Operational definition of the behavior patterns mentioned above has thus been realized, but the experimental set-up still needs to be worked out in detail. Further specifications will result from the choices that the investigator must now make. With respect to the experimental design of Barendregt's investigation (1954), we know already that he has restricted himself to the Rorschach in the matter of instruments, and to asthma as regards psychosomatoses. The contention that certain test behavior patterns are characteristic of asthmatics can, however, be demonstrated only by showing that they occur in these patients and not in others; or, to put it less stringently, that their occurrence is more pronounced and/or more frequent in asth5; 2; 4
145
5.
F R O M F O R M U L A T I O N TO T E S T I N G A N D
EVALUATION
matics than in others. Who, however, are these others? With what control population should the experimental population of asthmatics be compared ? Barendregt here follows GROEN (1953) in distinguishing three basically different consequences of maladjustment: 1) conflicts with the external world: socially unacceptable behavior, in the extreme case psychopathic behavior; 2) intrapsychic conflict: psychoneurosis and psychosis; 3) bodily disorders, in particular psychosomatoses. Thus, from these theoretical considerations, there are four possibilities, namely, comparison of asthmatics with: normals; 'psychopaths'; psychoneurotics and psychotics; and other psychosomatic patients. Barendregt has confined himself to the first and last possibilities. The hypothesis selected here for closer study posits the harboring of aggressively hostile wishes by asthmatics. It is assumed that this is not the result of the patient's illness (and hospitalization) but a specific characteristic of asthma. Hence it follows that the hypothesis can best be tested by comparison with sufferers from other psychosomatic disorders (likewise hospitalized). For this purpose, Barendregt has chosen patients suffering from ulcus duodeni. The hypothesis thus becomes, in the stage of specification now attained: In their Rorschach responses asthmatics reveal hostile wishes — to be measured in terms of the hostility index of Elizur — more frequently than patients with ulcus duodeni. Of course, operational definitions were now required of 'asthmatics' and 'ulcus' patients. In this matter Barendregt followed the diagnoses of the medical consultants on the team. Only cases which, according to these medical consultants, were beyond doubt were included in the investigation. 1 The next question to be answered concerned the composition of the experimental and control groups of subjects (patients) for the actual 1
Considerably greater difficulties were met when, for the testing of other hypotheses, an operational definition had to be formulated of 'health.' The instrumental realization of this concept was based on the card indexes of two general practitioners and house calls by a sociologist (see further op. cit., p. 11).
146
5; 2; 4
S; 2
F R O M F O R M U L A T I O N TO T E S T : A N
EXAMPLE
experiment. A concern of prime importance from the viewpoint of confirmation was experimental eliminationof potential extraneous or 'disturbing' factors (cp. 5; 1; 2). To this end Barendregt used so-called matched groups, which were, as far as possible, equated for a number of variables that are known to affect Rorschach variables. He worked with groups (samples) of 20 subjects each, all adult men, with roughly identical distributions of age, intelligence and occupational level (op. cit., pp. 12-14). In addition, all the subjects were tested by the same experimenter — an experienced Rorschach tester, who did not know the specific research goals. The effect of these procedures is that any differences found later can with reasonable certainty be ascribed to the experimental factor (asthma), irrespective of the possible relationships between the controlled variables and the Rorschach scores used (op. cit., p. 12). There is clearly a sound reason for these procedures: enhanced certainty that any positive outcomes will support the experimental hypothesis and leave no loopholes for alternative interpretations. On the other hand, they again imply restrictions and result in a further narrowing of the original scope and content. Strictly speaking, findings regarding this particular hypothesis allow only generalization from the investigated sample to a population of hospitalized male patients with comparable distributions as regards age, intelligence and occupational level. With regard to women and children, for instance, nothing has been established, because of the choice of men as subjects. Also, the formulation ought, strictly speaking, to include the qualification: 'as recorded by this experimenter.' In principle, the possibility cannot be ruled out that Rorschach protocols recorded by other experimenters do not discriminate between asthma and ulcus. Nor is it certain that the findings can legitimately be regarded as generally characteristic of asthmatics, that is, also in comparison with other populations. The truth of the matter might well be, for instance, not that asthmatics score extremely high on the hostility index, but that ulcus patients score exceedingly low. Every choice and decision implies a restriction, the immediate consequence of which is a narrowing of the experimental question. The question as to what consequences this has for confirmation will be further discussed in 5; 3.
5;2;4
147
5.
F R O M F O R M U L A T I O N TO T E S T I N G A N D
EVALUATION
5; 2; 5 Statistical testing: final decisions
We have now almost arrived at the point where Barendregt's sixth hypothesis can be formulated as a prediction. Our expectation is that when two samples of 20 subjects each, taken from asthma and ulcus patients respectively, and matched in certain respects, are studied, the former will in general exhibit higher hostility scores (according to Elizur's measure) in their Rorschach responses. More specifically, a 'significant' difference is expected. But the final decisions, those concerning the statistical procedure to be adopted, are still ahead. What model is to be used ? What statistical test is to be employed? What significance threshold will be adopted ? Is the test to be one-tailed or two-tailed ? It is obvious that a null hypothesis will have to be used. If ulcus patients and asthmatics are viewed as two populations of hospitalized males matched with regard to age, intelligence and occupational level, the null hypothesis will read: 'there is no difference between the two populations as to the distribution of the hostility variable specified in the hypothesis.' The testing procedure therefore seeks to establish whether, statistically, there are valid grounds for rejecting this null hypothesis. When, to this end, a specific statistical test is chosen, this again entails a specification: different tests are based on different assumptions regarding the population. Also, different statistics vary in the sensivity with which they register deviations from the null hypothesis. For this (sixth) hypothesis of hostile wishes, Barendregt chose the Wilcoxon two sample test (or, the Mann-Whitney U test; cp. S I E G E L 1956, p. 116 ff.). This is a nonparametric test (cp. 7 ;2 ;2), that is, one in which no specific assumptions are made regarding the distribution of the variable within the population^) — a judicious choice in this case, since little or nothing is known about the distribution of the hostility score. The null hypothesis now implies that the probability, within the population, for the hostility score of an asthma protocol to exceed that of an ulcus protocol is equal to the probability of the opposite finding — provided both protocols are picked at random. The alternative hypothesis, against which the null hypothesis is tested and which represents the hypothesis derived from the theory, states that the former probability is greater than the latter, that is to say, that the 'bulk' of hostility scores of asthmatics will be higher than the 'bulk' of scores of ulcus patients. To test the null hypothesis, the scores of the samples representing the two groups are combined and arranged in a descending order of magnitude. Thereupon it is ascertained for each 148
5; 2; 5
5; 2
F R O M F O R M U L A T I O N TO TEST : AN
EXAMPLE
asthma score how many ulcus scores are in excess of it. The resulting numbers are added up, and this yields the statistic U. It is now possible to calculate the probability — on the assumption that the null hypothesis is valid — that such an extreme (i.e., extremely small), or a still more extreme, U value will occur by chance. If this probability is 'very small' — which requires further specification — the null hypothesis is rejected (cp. the reasoning in 4; 1; 2). To us the essential point is that a fresh particularization has been introduced: a particular mode of deviation from the null hypothesis is specified by the statistical test chosen. In order to turn the hypothesis into a concrete prediction, there are required,1 finally, the choice of a significance level and the decision whether the test is to be one- or two-tailed. Considering that the alternative hypothesis specifies unambiguously the direction of the expected difference, Barendregt chose the one-tailed test, and the 5 % level. These are not, indeed, very rigorous standards, but at the given stage of development of investigations of this kind there was, in fact, little reason to tighten up the requirements where samples of this relatively small size are concerned. At all events, any requirements of this kind must be stated in advance; this, then, is the last step in the process of specification. The prediction now reads as follows: In two samples of, respectively, 20 asthmatics and 20 ulcus patients matched on a number of variables, the hostility scores (according to Elizur's procedure) will in general be found to be higher in the first group than in the second; the difference is expected to be significant at the 5 % level in Wilcoxon's one-tailed two sample test. Thus the process of deduction and specification has been brought to a conclusion. What remains to be done is to carry out the testing — and to evaluate the outcomes.
1
These requirements follow from current traditions in statistical hypothesis testing. The question whether these traditions might be profitably replaced by others is not considered here.
5;2;5
149
5.
F R O M F O R M U L A T I O N TO T E S T I N G A N D
5; 3 T E S T I N G A N D
EVALUATION
EVALUATION
5; 3; 1 Execution of the testing procedure
Ideally, a hypothesis testing procedure for which detailed preparations have been made will proceed 'smoothly,' i.e., entirely according to plan. Any potential disturbing factors will have been spotted and eliminated beforehand; successful pilot investigations will have guaranteed the practicability of the undertaking; every detail of the actual testing procedure will have been arranged in advance, and set down in black and white; there is virtually nothing that can go wrong. This ideal picture can sometimes be realized in practice, in particular when the corresponding activities can be confined entirely to the study or to the laboratory. In the former case the research data that are to be analyzed for the purpose of hypothesis testing may either be already present or prove to be available elsewhere without unforeseen difficulties. In the latter case — e.g., in psychological laboratory experiments — the data, it is true, still have to be obtained, but it is not unusual that the conditions and the subjects (e.g., college students) can be controlled so well that everything will in fact proceed according to plan. From a methodological point of view, such cases require hardly any comment. In general, however, even after thorough preparations, surprises cannot be ruled out altogether. This is particularly true in the case of field studies in which the investigator must depend on, for instance, the voluntary participation of subjects and/or the continued mediation and willing cooperation of others. It may occur, for instance, that a promise to open up certain archives is withdrawn, or that other sources are found to be inaccessible, or that carefully specified numbers (of cases or subjects) cannot be collected after all, or that human errors are made, or that unforeseen disturbing factors manifest themselves which upset the confirmation value of the findings. An example may illustrate this last point. In a study carried out at the Delft Institute of Technology, it was attempted to crossvalidate some of the test findings obtained with college entrants of 1953 on the entrants of 1954. The second time around, however, attendance at the testing sessions was poor, and, what was worse, demonstrably distorted — possibly as a result of agitation against the study in certain quarters (TECHN. INST. D E L F T 1959, p. 25). The sample, therefore, could not be considered (or made) representative of the Delft student population, so that the confirmation value of the outcomes became 150
5; 3; 1
5; 3
TESTING AND
EVALUATION
questionable. This is by no means an isolated case; whenever voluntary attendance or participation is needed, the risk of a disturbing selection factor is considerable and hard to eliminate. 1 Because of such surprise elements entering into the execution of a hypothesis testing program, verification of the predictions may yield the third possible outcome (cp. 3; 4; 2): verifiability conditions not fulfilled. In research practice there will of course be many borderline cases: 'In spite of certain weaknesses the study has its good points.' It would surely be convenient if it were possible to draw a clear demarcation line between cases of total loss and worthwhile residues; i.e., between those cases in which it is better to give up the investigation altogether — and possibly throw out the data — and cases in which the optimal policy still is to try and make the best of it. This, however, cannot be done very well in general terms. Again we shall have to content ourselves with a few rather vague recommendations based on common sense and experience. First, it is vitally important that the investigator should not conclude that there is a 'disturbing' factor in the experimental procedure unless there are unmistakable indications to this effect. He may be sorely tempted and quite often be in a position to interpret the outcomes thus, i.e., as a (c) instead of a (b) case (cp. 3; 4; 2); as we have seen, the verifiability conditions always leave a margin of uncertainty. However, the investigator ought to resist this temptation. He should not rashly assume a disturbing factor — merely in order to safeguard a cherished hypothesis against evidence to the contrary (cp. 3; 4; 3). This is in fact one of the rationalizations lying behind the unfortunately widespread practice of publishing only positive results (a). A highly undesirable consequence of this practice is that if one tries to find out from publications what confirmation is available for a given theory or hypothesis, one gets a distorted picture. Even if the investigator thinks that his interpretation 'verifiability conditions not met' is correct, he will be well advised to publish the negative (non-(a)) results as they are; together with his interpretation (c) if he so wishes — so that it can be challenged by others. 1
The influence that such selection factors can have was nicely demonstrated in a study carried out on part of the Amsterdam student body ( S P I T Z 1955). There it was found that the factor (voluntary) participation or non-participation in the experiment was itself a better predictor of future academic success than any test could have been — in the sense that a considerably higher percentage of the subgroup of those participating in the experiment graduated afterwards than of the absentees. For acomparable finding, see H U E L S M A N 1968.
5; 3; 1
151
S. F R O M F O R M U L A T I O N
TO T E S T I N G A N D
EVALUATION
If, on the other hand, the disturbing factor can be specifically demonstrated, if for instance it is clear that the data were contaminated and in what way, or that the sample cannot be regarded as representative of the population, and the like, then the only logical decision may well be to relegate the study to the wastepaper basket, a decision that one must be prepared to make. A possible good reason for not doing so might be that an unprejudiced description of the failure may be instructive to others who intend to carry out investigations in the same area; or that the observed or supposed disturbing factors are instrumental in new theory formation. A vivid example of the latter is the above cited 'failure' of the Relay Assembly Test Room experiment in the Hawthorne studies (ROETHLISBERGER and DICKSON (1939) 1949). Finally — and this is obviously a matter of common sense — the 'shortest' errors are the best. In investigations extending over a prolonged period it is essential to discover at an early stage that, for instance, the research design is faulty, or that the investigation cannot proceed according to plan because of extraneous factors — and to decide as soon as possible to abandon the project. Painful as this decision may be, it is in such a case the wisest course. Hypothesis testing procedures must not be carried through blindly, however perfectly their mechanics may have been prepared. As the procedure unfolds, considerations pertaining to confirmation or practicability (5; 1; 2 and 5; 1; 3) must have a chance to assume veto power. Procedurally, Barendregt's investigation, to which our attention will again be confined in what follows next, was carried through without serious difficulties. As usual, this is evidenced in his publication by the fact that there is hardly any mention of the subject. 5; 3; 2 Disturbing factors
Can Barendregt's outcomes, which were positive for 6 out of the 7 hypotheses tested, be interpreted otherwise than as positive confirmation of these hypotheses ? Were there any weak spots in his experimental design or in its execution ? Were there contaminations allowing of alternative interpretations — in the sense of ascription to disturbing factors? Two main points have been made in critical discussions of his work. 1 The first concerns the nature of asthma as diagnosed by the medical
1 Both advanced by the late Prof. D. Van Dantzig in the statutory debate prior to the conferment of the Ph. D. degree.
152
5; 3; 2
5; 3
TESTING
AND
EVALUATION
staff members. These latter may be assumed not only to have worked with the psychosomatic theory concerning asthma, but also to have subscribed to it. Most certainly, they did not adhere to the classical medical conception of asthma as an allergic disease. Now, supposing that both modes of origin occur, that both factors may be operative — a view held by many — then the fact that the patients participating in this experimental study, first, chose to register with this particular clinic and, second, were diagnosed as asthmatics by the medical staff working there constitutes a potential contamination. The sample may well contain a proportion of psychosomatic asthmatics in excess of that found in the entire asthma population. Or, to put it more strongly, the patients may have undergone a pre-selection not only for asthma, but also for their 'asthma personality' (according to psychosomatic theory); the correlation found may therefore be an artifact of this selection. The second point concerns the scoring of some of the Rorschach variables used, e.g., the hostility index. Although precise guide rules had been laid down, the scoring was not altogether objective; moreover, it was performed by the investigator himself, who (a) knew from which patient group (asthma, ulcus, normal) a given response originated and who, like the medical team members, was (b) scientifically interested in a positive outcome of the experimental test. Again a 'disturbing' contamination factor. As for the first point, it should be noted that the potential influence of this selection factor could hardly have been eliminated, given the organizational setup within which Barendregt was working. The study had to be carried out at this particular clinic; for this particular study only these patients were available. This to some extent exonerates the investigator — he was not responsible for that — but as a scientific argument it cuts no ice. For a scientific analysis we must try to gauge how serious the influence of such a selection factor can have been in the context of the experimental question, namely: Can it be shown that there is a (statistical) relationship between asthma and (one of) the specific personality traits ascribed to asthmatics by the psychosomatic theory? And if so, exactly what theory or hypothesis is refuted (cp. 4; 1; 3)? As far as statistical testing goes, the experiment does not discriminate between the purely psychosomatic theory and the theory that both psychosomatic and allergic forms occur. The argument that because of biased selection the sample must have contained a relatively larger pro5; 3; 2
153
5.
F R O M F O R M U L A T I O N TO T E S T I N G A N D
EVALUATION
portion of psychosomatic cases, therefore, does not carry much weight. As yet the primary concern is to demonstrate that such cases do in fact occur, rather than to establish their frequency within the population. The study only seeks to establish grounds for rejection of the null hypothesis: no difference between asthma and ulcus; and this ties in with the purely allergic theory. Only the 'strong' argument alleging that the correlation found is entirely an artifact of the selection process therefore presents a serious difficulty that would indeed impede refutation of the allergic theory. No strictly logical arguments can be adduced to dispose of this objection. The most that can be said is that it seems 'unlikely' that a statistically significant difference, in line with a theory based on careful clinical observations, could in such comparatively small samples have been caused entirely by any unconscious selection factor in patient registration (1), and in the undoubtedly largely objective diagnostic procedure (2). Particularly this latter point (2) seems to leave little room for ambiguity: an asthmatic is an asthmatic and is admitted as a patient. The first point (1) is harder to evaluate. There are most certainly cases in which precisely such an unverifiable selection factor has a misleading effect. In this instance, however, that would mean, in concrete terms, that hostile asthma personalities happen to show a preference for registering with this municipal clinic — which does not appear a very plausible assumption. The second point is more serious: in principle the positive outcome for the non-objective variables may be an artifact of possibly unconscious, wishful scoring by the experimenter, however hard he may have striven for objectivity. It may be objected that the scoring was in fact subject to fairly strict rules and almost objective; still, this is not sufficient, particularly since this contamination factor could have been experimentally eliminated. The proper technique would have been to have the responses separately scored by a judge who had no way of knowing to which protocol they belonged. This criticism 1 has in fact led Barendregt to repeat the investigation with an improved procedure (cp. BARENThis criticism had been voiced before at a staff meeting. The data processing and reporting, however, were by then too far advanced — a practical consideration — to call a halt and start all over again, which at an earlier stage would have been the only correct procedure. The effect cannot have been serious — cp. the open discussion of it on p. 36 op. cit. — all the same, it was a fault in the research design.
1
154
5; 3 , 2
5; 3 T E S T I N G A N D
EVALUATION
DREGT 1 9 5 6 , a n d BARENDREGT, ARIS-DIJKSTRA, DIERCKS a n d
WILDE
1958). The outcome was, for his (sixth) hypothesis, positive once again. 5; 3; 3 Problems of generalization
We have seen in 5; 2; 4 that the matching of the sample groups (asthma, ulcus, and normal) with regard to sex, age, intelligence, occupational level, and finally, experimenter — variables which are known to have a potential eifect on Rorschach scores — has brought about a further narrowing of the concrete experimental question. As a result, the experimental group (asthma) becomes a sample from 'a population of hospitalized male patients with comparable distributions of age, intelligence and occupational level' (5; 2; 4, p. 148). Further, the experimental variable is, strictly speaking, only: the hostility score according to Elizur's procedure as derived from Rorschach protocols administered by experimenter E. If we accept the legitimacy of statistical generalization from sample finding to population, then confirmation has indeed been obtained for a general hypothesis, but this hypothesis relates to a highly specific characteristic in a limited population. Now, if the only objective were simply to prove the existence of certain differences — perhaps only in a limited subpopulation — then all these restrictions would not matter so much. But at the same time the investigator naturally wants to go beyond this point and obtain confirmation for the theory, itself. Consequently, he is confronted with the question: How far is it legitimate to extend the generalizing procedure and expand the findings a) to a universe of less specific characteristics, b) to a more comprehensive population of subjects ?
Barendregt himself has hardly touched on this question. His own evaluation on this score is restricted to a statement of his conclusions (op. cit., p. 49): 'It is the author's opinion that the confirmation obtained in the present investigation for these hypotheses derived from the medical literature also lends support to the generalized hypothesis of psychosomatic specificity.' Barendregt does not further specify the steps in the generalization process. This might be held against him as a shortcoming, notably from the viewpoint of the (fourth) principle which demands that the empirical references of a theory be clearly stated (3; 1; 5), were it not that Barendregt expressly presents his contribution as solely a test, of hypotheses 'derived from the medical literature.' Therefore, the psychosomatic study group as a whole, rather than Barendregt personally, is 5; 3; 3
155
5. F R O M F O R M U L A T I O N T O T E S T I N G A N D
EVALUATION
primarily responsible for the empirical references. Moreover, there is no need for extensive evaluation after each and every study; that is often better postponed until such time as a number of related empirical studies are available for an overview. This does not mean, however, that the methodologist can now consider the subject closed. The considerations that follow will deal with the problem of generalization as it generally manifests itself in the evaluation of empirical or experimental results. From here on Barendregt's investigation will be referred to only occasionally by way of illustration. It will be apparent that we are again confronted with the problem of induction (cp. 2; 1; 2) or, to put it differently, with the problem of generalized confirmation (cp. 4; 1 and 4; 2), and indeed with one of its most important and difficult forms. This problem is of major importance, for instance in the evaluation of rigidly controlled (laboratory) experiments in psychology. Frequently, in these experiments numerous restrictions and limiting conditions are introduced to ensure rigorous hypothesis testing with unambiguous statistical confirmation value — at the expense of the domain covered, i.e., of the generality of the reference universe(s).1 How are we, in this case, to make 'the way back' (cp. 4; 1; 1)? How can we proceed from here to those comprehensive general statements that are our real objective ? It is a fact, at all events, that this 'way back' is one of the common highways of science, that we do make these generalizations, almost daily. This habit appears to be ingrained in the very language used: 'With scarcely an exception, the conclusions of all studies of behavior express an (...) expansion beyond the researcher's observations to an indefinite universe of events. We speak not o f ' 'the rats in this study" but of "organisms"; not of "running this alley" but of "response"; not of "college sophomores'' but o f ' 'small groups''. With remarkable unanimity, scientists are willing to lay down inclusive dicta about events which they have not observed, even about events which could not have been observed' (MANDLER and KESSEN 1959). What is the rationale for such generalizations? How can they be justified? Strictly speaking, they are simply untenable from the viewpoint of logic if we do not accept a 'principle of induction' (cp. 2; 1; 2, see further 1
Compare discussions on the value of socio-psychological group experiments under 'unnatural' laboratory conditions, where content is sacrificed in favor of accuracy (e.g., DUIJKER 1955).
156
5; 3; 3
5; 3 T E S T I N G A N D
EVALUATION
4; 1 and 4; 2). That is to say, the answer to the questions posed must of necessity be of an empirical nature: extend the investigation to include (all) other ramifications of the same hypothesis or theory. In the example under discussion this would mean: replications with other experimenters, other operational definitions (specifications) of 'hostility,' other test methods, and, finally, with other asthma personality traits derived from the same theory — all this as regards generalization in terms of characteristics. As for generalization of the population: extend and replicate the experiments to include other intelligence levels, other age groups, other occupational levels, and also women and children. This strictly empirical answer to the problem of generalization has the merit of being quite realistic, insofar as varying experimentation is absolutely indispensable for a more general confirmation of the hypothesis and theory concerned. But it would not be realistic to suppose that, given the multiplicity of possible ramifications, even an approximation to 'completeness' is within the bounds of possibility. The empirical answer therefore requires supplementation. To some extent a technical answer will serve this purpose: through experimental techniques and statistical analyses. Modern techniques of experimental design and statistical multivariate analysis make it possible to manipulate a number of variables and to control or to determine a number of effects at one time (see for instance M A X W E L L 1958). Thus various ramifications can be dealt with simultaneously in one well-designed investigation. A difficulty besetting experimentation in the behavioral sciences is, however, that many situational and subject characteristics cannot be systematically varied as easily as external experimental conditions can. Required combinations of psychological characteristics, such as intelligence or occupational level, for instance, cannot be manipulated; they can at best be laboriously assembled by painstaking search. Even apart from this difficulty, sophisticated experimental and data processing techniques may contribute to produce efficient research methods, but where complex theories (like that of the asthma personality) are concerned, they cannot make 'completeness' any less unattainable. Another possibility is to attempt a probabilistic answer, in other words, to reduce the problem of generalization again to a question of statistical confirmation. To this end it will be assumed, for instance, that the various empirical specifications — the choice of an experimenter, of a personality trait, of an operational definition of this trait; and analogously the choice 5; 3; 3
157
5.
FROM
FORMULATION
TO T E S T I N G AND
EVALUATION
of the restrictions on the population such as sex, intelligence etc. — have resulted from a series of successive random choices among empirically given alternatives. If the assumption of random choices is indeed tenable, then a step-by-step procedure of this kind may be regarded as a method of composing a 'systematically randomized' sample from, on the one hand, all possible ramifications of the theory and, on the other hand, the total population to which the theory is relevant. The generalization in terms of population, then, no longer presents a problem: under a number of assumptions the sample may in fact be regarded as a random selection from the total population. As for the generalization in terms of hypothesized traits, if the procedure of 'random' selection of one particular trait is repeated a number of times, and if each result is positive, it will be possible, e.g., by means of a simple sign test (cp. e.g., S I E G E L 1956, p. 68-75) or a more powerful method which allows for combining individual P-values, to obtain statistical confirmation of the entire theory. This reasoning is certainly enlightening as a schematic demonstration of the possibility of probabilistic theory confirmation. Even if cases in which this approach can actually be made are exceptional, it is important to note that probabilistic arguments can be invoked to support the position that there is no need for completeness in the testing of ramifications. Actually, however, both the choice of a particular ramification and the generalized confirmation procedure will present a completely different picture. Nothing is in fact 'less random' than the specifications which the investigator chooses to introduce. They are, as we have seen in 5; 2, based on specifiable and realistic considerations pertaining to confirmation value (5; 1; 2) and practicability (5; 1; 3). To give another example, Kurt Lewin used to advise his associates when about to undertake research in a new field: 'Start strong.' That is, among other things, choose those ramifications and empirical specifications — e.g., hostility, male subjects — which you expect to produce clear, positive confirmation. If this expectation is fulfilled in preliminary experimentation, you know at least that what you intend to tackle (the field of inquiry, the theory) is worthwhile. Such a choice, however, is clearly at the opposite pole from random selection. The same applies to the great majority of the specifications and deductions in research practice that lead to the actual prediction. The probabilistic answer, therefore, even when combined with the 'empirical' and 'technical' ones, is insufficient; or rather, it is unrealistic. 158
5; 3; 3
5; 3
TESTING
AND
EVALUATION
In actual fact, the gaps between the scattered, isolated testing points in the area that a theory claims to cover, or — to use another metaphor — the large meshes between the strands of its nomological net, are filled in with other arguments in favor of generalization. For the generalization of research findings and the acceptance of a theory or hypothesis (4; 2), it is also of undeniable importance whether or not the generalization or theory is plausible. If it is, we make no bones about interpolating; if it is not, we demand additional intermediate testing points. Preferably, the testing points should be more or less evenly distributed over the area that the theory purports to cover; hence for instance the recommendation to test out the results of laboratory experiments in real-life situations (FESTINGER 1953, pp. 140-141). In other words, the actual generalized confirmation value of research findings will depend on whether the complex of interpretations and generalizations — interpolations between the testing points — makes sense and appeals to our insight. That is, it must be in accordance with our general experiences, for instance in the clinic, in everyday life, in applied areas and thus in accordance with a large number of mainly implicit hypotheses1 which we accept on the strength of these experiences. To be sure, it is characteristic of science that these very experiences are, quite rightly, called into question again and again. But at some point there must be a truce to questioning, particularly in the evaluation of research findings. Thus, in bringing an investigation to a (provisional) close, we answer the generalization question in terms of what is plausible and evident, that is, we accept an 'insightful' system of relations — more or less in the manner of 'Verstehen.' That is how we do, in fact, tie up the strands of the nomological net; but it is to be expressly noted that in the scientific process such 'evident insights' can never be regarded as definitive arguments (cp. 2; 2; 5). Even accepted theories retain, scientifically speaking, their 'provisional' character (cp. 3; 2; 2 and 4; 2; 2); the inquiry can at any time be reopened in a fresh cycle (cp. 1;4; 6). Any theory remains open to refutation or rejection. 1
When in learning experiments we expand our findings from rats to 'organisms,' the implicit hypothesis is quite clear: we assume an essential analogy in the reactions of rats and other animals. The same applies to, for instance, generalizations from men with an (average) age of 45 (BARENDREGT 1954, p. 12) to, say, 30 year old men; or from men with an (average) 1Q of 113 (op. cit., p. 13) to men with an (average) 1Q of 100. Some of these generalizations (implicit hypotheses) we simply accept — until such time as they are refuted.
5; 3; 3
159
5.
F R O M F O R M U L A T I O N TO T E S T I N G A N D
EVALUATION
Now, what possibilities does Barendregt's investigation offer by way of 'evident generalizations' ? It would carry us too far into theoretical arguments to work this out in detail, but in summary we can indeed endorse Barendregt's 'evident insight' that the psychosomatic theory of asthma has received a measure of support through his investigation (but cp. 5; 3; 4). This is not to say that its position is as yet particularly strong. The fact is that the theory has certainly not found general acceptance (with the forum); but it is accepted, at least as a partial explanation, by an ever growing number of experts. The basis for this acceptance — and likewise in part for the generalization of Barendregt's findings — is formed by: a number of other psychological hypothesis testing investigations (e.g., HECHT 1952; POSER 1953; LITTLE and COHEN 1951;RAIFM A N N 1957), by the relevant clinical case studies and interpretations (e.g., D U N B A R 1947; G R O E N 1950), by daily experiences with asthmatics, by therapeutic results (e.g., GROEN 1950,1953) — and by the plausible and 'insightful' relationships between all these factors as stated in the theory. 5; 3; 4 Cause or effect?
We cannot conclude our discussion of the evaluation of Barendregt's investigation without paying some attention to another possible objection: whether the peculiarities in the personality structure of asthmatics are not the effect rather than the cause of the disorder. In this counter argument, therefore, the concrete findings are accepted, together with the first generalization: asthmatics are more 'hostile,' etc. Consequently, we are not dealing with a disturbing factor (5; 3; 2) but with a possible alternative theoretical interpretation of the findings (cp. 5;1;2). The counter argument proceeds roughly as follows. Asthma is an allergic disorder caused by an allergic predisposition. Attacks are characterized by a feeling of suffocation. This sense of constriction, of oppression, comes to dominate the psyche of the asthmatic, also outside of his attacks (alternative interpretation of Barendregt's fifth hypothesis). In his social contacts the asthmatic thus tends to feel threatened and hampered in defending himself; hence the stronger hostility (sixth hypothesis). On the strength of Barendregt's experiments little or nothing can be advanced against this argument. His findings regarding hostility and the contentions concerning a personality structure characteristic of asthma 160
5; 3; 4
5; 3 T E S T I N G
AND
EVALUATION
are not affected by it, but what is affected is the confirmation value for the psychosomatic theory of the etiology of asthma. If the hostility can also be an effect, then nothing has been proved regarding the personality structure as the causal factor. It will be apparent that no 'evident generalization' will avail against this argument. It is not a problem of generalization, but a question of causality, which cannot be answered by the kind of correlation research that Barendregt has conducted. The only way to counter it is the empirical one; for instance, investigate directly the pathogenesis of asthma in children, whose personality structures cannot yet have been affected by frequent attacks, or something of the sort. This will not be easy in view of the interaction that is bound to arise, at a quite early stage of development, between the personality structure and the asthma experiences. Perhaps the personality structure (Groen's first consequence; cp. 5; 2; 1) is not, after all, a good starting point. In the case of asthma at any rate, which often manifests itself at a very early stage, investigations directed upon the personality structure cannot discriminate sharply between cause and effect. If this conclusion is correct, it would probably be better to aim future investigations at environmental factors (Groen's second consequence; 5; 2; 1), for instance at the 'asthma mother' — who is described in the literature as 'lovingly overbearing' (cp. G R O E N 1950). If we review all the confirmation factors that have been discussed in 5; 3, it will be clear that while Barendregt's investigation answers certain questions and can claim a measure of confirmation value, it primarily raises a large number of fresh empirical and theoretical questions. Such is the case not only in this particular instance, but in virtually all empirical work. Every scientific investigation calls for, and leads to, a fresh round of more properly focused inquiry. The work goes on; the spiral continues to gyrate.
5; 3; 4
161
CHAPTER 6
OBJECTIVITY: A. T H R O U G H T H E E M P I R I C A L C Y C L E
6; 1 T H E P R I N C I P L E O F O B J E C T I V I T Y
6; 1; 1 What is objective?
In the preceding chapters the term 'objective' has been used repeatedly, mostly in the sense of a requirement to be met by an activity or the product of an activity in the scientific process. Now, a more detailed discussion of the concept 'objective' itself, of the principle of objectivity and of the methods used to guarantee or promote objectivity, is in order. The term is derived from 'object' as distinct from 'subject': the object is that to which the subject — the organism, man, in our case the scientific investigator — addresses himself; that which is before his mind; that which he perceives, observes, describes, seeks to study, envisages or wants to attain. In the latter meaning object is synonymous with goal or objective (cp. DREVER 1956 under 'object' and 'objective'). This is one of the meanings of the term in scientific usage, where one speaks of an 'object of study,' or the object of a science or of a branch of science (e.g., BRUGMANS 1954, DE GROOT 1950b). Apart from other connotative variations, these expressions are marked by a peculiar duplicity of meaning. In this type of context, the 'object' of a study or of a science is likely to refer both to: what is being studied (the phenomena, the data, the materials, the facts) and to: why it is studied1 (the research goals,
1
Apparently, this ambiguity was long the besetting predicament of psychologists seeking to define 'the object* of their science. By way of concise description, they would often proclaim that psychology is the study of behavior — but that in effect saddled the term 'behavior' with both loads. It was not defined merely as 'every recordable activity, or result thereof, of the (human) organism' — covering the data aspect: what is studied — but the activities must also be 'molar' or 'purposive' or 'individual,' or in some other way 'make sense' psychologically — adding the aspect of what
162
6; I ; 1
6; 1 T H E P R I N C I P L E O F
OBJECTIVITY
the laws, theories, the insights we seek to gain, the ideas we are after; cp. D E G R O O T 1952b; S N I J D E R S 1951, 1952). These various aspects of 'object' are also found in the terms 'objective' and 'objectivity.' An activity or its results may be called 'objective' if, in accordance with the purpose envisaged, the object of study itself is done full justice — is allowed to speak for itself, as distinct from that which the observer, judge, interpretator, theoretician reads into it 'subjectively.' Especially this latter, negative aspect: absence of subjectivity as a disturbing factor1 is characteristic of the term as usually employed. The general requirement of objectivity, then, implies that the investigator must act as 'objectively' as possible, that is, in such a way as to preclude interference or even potential interference by his personal opinions, preferences, modes of observation, views, interests or sentiments. 6; 1; 2 Objectivity a basic requirement
It hardly needs stressing that we are here dealing with the basic attitude of the scientist (cp. 1; 3). Unprejudiced, objective inquiry, which seeks only to make the object of study speak for itself, even when the investigator is in fact emotionally involved in the outcome, is the ideal of science. This ideal is by no means always easy to attain or even to approximate closely. In the pursuit of science there is often the same passionate personal involvement as in art, sports or politics; here, too, strong private interests may be at work — financial support, prestige and reputation, rivalry, possibly personal feuds. In addition, in the behavioral sciences the object of study itself is often an area beset with irrational
kind of hypotheses are sought, why, for what purpose the activities are studied (cp. DE GROOT 1965). Nowadays, we speak freely of 'behavioral sciences' in the plural —• indicating that the Why-element in the definition of behavior, if not entirely eliminated, is at least soft-pedaled. 1 It should be noted that what is demanded is not absence of subjectivity as such. On such an interpretation, the requirement of objectivity would bar from study such subjective phenomena as hallucinations, opinions, judgmental processes, feelings. The term 'objective' is indeed sometimes used in this radical sense — e.g., in speaking of an 'objective psychology' which refuses to take into account verbal behavior content, or in contrasting 'objective tests' with e.g., questionnaires, which register (subjective) opinions, feelings, preferences. For our purpose we wish to dissociate ourselves expressly from any such interpretation: only disturbing subjectivity, i.e., subjectivity which contaminates the object of study, is here excluded. The object of study (that which is studied) may in itself consist of subjective data.
6; 1; 2
163
6.
OBJECTIVITY:
A. T H R O U G H
THE EMPIRICAL
CYCLE
sentiments. In the study of human relations, politics, education, and even such a seemingly straightforward scientific topic as the heritability of intelligence (cp. PASTORE 1949), there are often special difficulties in the way of objectivity. Public opinion, and in particular the views and policies of powerful authorities on whom the investigator may depend for material support, are strongly affected by group interests. Anyone contemplating work in these areas should be prepared to take up the cudgels against contaminations by disturbing subjective factors. 1 Subjectivity may make its disturbing effect felt in many different guises and in many places in the scientific process. In the example of Barendregt's research we have seen that criticisms, including his own (BARENDREGT 1954), focused on the not altogether objective sample of asthmatics and the imperfectly objective method of scoring (5;3;2). In the scientific debate, criticisms regarding deficiencies in objectivity are in fact particularly frequent. Compare for instance the criticism leveled at the Kinsey report, particularly at its sampling techniques (HYMAN and SHEATSLEY 1954b; DE K O N I N G H 1960), or the many critical discussions concerning instrumental realization of the construct of 'authoritarian personality' (ADORNO et al. 1950; see e.g.,HYMAN and SHEATSLEY 1954a). Numerous studies have shown that in hypothesis testing investigations there is indeed every reason to be suspicious of non-objective methods, that is, of human observations and judgments. Man, it appears, is an unreliable 'instrument,' particularly when emotional factors are involved. Experiments on the unreliability of eyewitness accounts, and on suggestibility, are among the oldest in psychology. More recent experiments (ASCH 1952) have shown that even such a simple judgmental task as estimating the relative lengths of two lines can easily be muddled by the suggestive influence of the (incorrect) judgments of others. Few findings in psychology are so abundantly documented as those concerning the 1
We deliberately exclude here bad faith on the part of the investigator, i.e., deliberately subjective distortions of facts and arguments. In actual practice, unfortunately, this is not always possible. In addition to 'political' evaluations of research outcomes — sometimes by the investigators themselves — there occasionally occur in science, as is well-known, downright forgeries and frauds, sometimes on quite a large scale. One may in fact encounter all variations and combinations ranging from deliberate distortion (or fabrication) to unintentional subjectivity. For an instructive survey of excesses of this kind, of 'fads and fallacies in the name of science,' which sometimes enjoy great success and a large following, we refer to the book of the same name (GARDNER 1957).
164
6;1;2
6; 1 THE P R I N C I P L E OF OBJECTIVITY
subjectivity of observation and judgment (cp. e.g., SOLLEY and MURPHY 1960; FESTINGER 1957; see also VAN DE GEER 1955, III, 2). Frequently the factors affecting judgments escape detection by the observer himself. A common experimental finding, for instance, is that in judging photographic portraits for intelligence, trustworthiness, etc., the judge is to an appreciable degree influenced by cues such as dress and hairstyle without being aware that he is not judging the facial expression alone (cp. e.g., THORNTON 1943, 1944 on the effect upon judgment of wearing glasses). In other words, the observer or judge may actually be influenced by cues which he does not know he has taken into account, noticed, or even perceived. Modern research on perception without awareness (subception) and learning without awareness seeks to demonstrate the phenomenon experimentally and to explore its boundaries (cp. MCCONNELL, CUTLER and MCNEIL 1958). The fact that there is such a possibility at all throws serious doubt on any claims by judges that they have taken into account only the stated experimental factor — e.g., only the handwriting and not the contents (cp. DE GROOT 1947a, p. 384, JANSEN 1963). It might be thought that at least the degree of (subjective) certainty or confidence with which an observation is reported, or a judgment or interpretation pronounced, would correlate positively with the objective accuracy of the observation, judgment, or interpretation. Experimental investigators, however, have repeatedly reported no correlation between a judge's confidence and his accuracy in judging (cp. BARENDREGT 1961, Ch. 5; GOLDBERG 1968). However, this outcome must be qualified: reports of 'no correlation' generally are found in investigations where sets of ambiguous stimuli, or difficult-to-judge borderline cases, were used. But, are not these borderline cases precisely the kind of materials for the assessment of which we are inclined to consult an expert judge? 1 Insofar as this is true, it is not much use taking into account the degree of confidence with which a judgment is offered. Apparently, human observers, including 'experts' (clinical psychologists, physicians, judges), under circumstances where a number of doubtful indications can be 1 There is a 'restriction of range' effect here. If we consider, for a given judgmental problem, 'all' possible cases, including those which on the strength of the data are almost (objectively) certain, a clear correlation between certainty and accuracy will often be found. Within the more limited range of uncertain or ambiguous cases, however, the correlation will often be found to vanish, even when the judge is — objectively — an expert in the field in question.
6;1;2
165
6. OBJECTIVITY: A. T H R O U G H THE E M P I R I C A L CYCLE
combined to form an understandable pattern, tend to attach too much belief to this pattern — a psychological process which has been described (e.g., DE GROOT 1947a, p. 395 if.), but which has not yet been sufficiently investigated (see, however, GOLDBERG 1968). Often it is hard to tell precisely where the problem lies. It is just that the results make one wonder. A case in point is the evident subjectivity of psychiatric and clinical psychological diagnoses (cp. e.g., ASH 1949; FOULDS 1955; WALLINGA 1956), particularly if the outcomes of investigations are contrasted with the confident assurance with which such statements are apt to be made and the confidence with which they are accepted. In the same way, the correlation between individual characteristics of the experimenter (judge) and test scores obtained by subjects makes one uneasy (e.g., for Rorschach scores, see SANDERS and CLEVELAND 1953). There is room for skepticism, too, when pronouncements and findings concerning differences in intelligence between races or social classes, obtained in seemingly quite objective investigations, are found to vary greatly with time and, in part, with the political leanings of the investigator (cp. e . g . , P A S T O R E 1949; SHERWOOD and NATAUPSKY 1967). For a discussion of the problems of objectivity in investigations in this area, the reader is referred to ANASTASI 1958; for 'the psychology of the experiment' to ROSENTHAL 1963). The significance of all these findings, unfortunately, has not yet found general recognition. In some of the more 'tender-minded' circles of behavioral scientists, subjective plausibility, 'insightful' relationships, a feeling of inner certainty in the framing or acceptance of an interpretation are still considered sufficiently firm grounds for assuming their validity. This is even more true of applied areas, for instance with regard to the use and acceptance of clinical methods in industrial psychology. A special difficulty is that the outsider — whether as patient, testee, member of a jury, layman in search of advice, or as research sponsor — is often insufficiently aware of the dangers of subjectivity. As for the research sponsor, this circumstance may lead him to question why serious investigators or consultants find it necessary to go to such great lengths, and to prepare such complex and costly designs, to guarantee objectivity. The sponsor is likely to expect too many results too quickly and too cheaply; in addition, he often wants more certainty than can reasonably be given. After all that has been said, it will be clear that it is not perfectionism 166
6; 1; 2
6; 1 T H E P R I N C I P L E O F
OBJECTIVITY
(in the mode of I'art pour Vart) which makes the investigator observe objectivity in his procedures. It is bitter necessity; the requirement of objectivity is basic. 6; 1; 3 Objectivity in research design
In principle, any verb stating what the investigator does or should do may be qualified by the adverb 'objectively.' But this adverb, then, does not always have precisely the same meaning. The practical meaning of the requirement of objectivity varies fairly systematically with the phase of the cycle (1; 4) in which a given activity has its place. The first phase is characterized by 'freedom of design' (2; 1; 2); so it is obvious that here no strict objectivity rules can be laid down. The most that can be said is that the chances for sound, useful hypothesis formation are enhanced if the investigator can make unprejudiced (objective) observations, if he can objectively describe and order what he has observed, and if his interpretations are objective enough 'to do full justice to his object of study.' All this is quite relative, however; a certain degree of subjectivity in the 'creative' first phase is inevitable and even necessary. What the investigator does at this stage, and how he does it, is, in principle, still his private concern; so, at most a recommendation can be made: 'try to maintain (a degree of) objectivity in hypothesis formation.'
The fifth phase bears the most similarity to the first. True, evaluative statements do not remain private as a rule, but in large part the processes involved here are, again, those of summary (of the outcomes), assessment (of their confirmation value), and interpretation (frequently leading to fresh, modified hypothesis formation), which defy strict rules or criteria of objectivity. Of course, published reports must be objective, that is, true to fact and not tendentiously incomplete. But, the question whether they are, is itself a matter of, more or less subjective, judgment. 1 The rule that can be given here is that the reader must be put in possession of all the relevant facts about the experimental design (cp. 5;1), the data processing, and the outcomes in such a way that, if he should be so 1
Stated otherwise: this is a field governed more by existing traditions and 'unwritten rules' capable of different applications (cp. 1 ;3;4) than by clear-cut, explicit precepts. Fairly frequent transgressions of the (unwritten) rules of objectivity are: failure to report negative research outcomes, and deceptive presentation of an exploratory investigation as a hypothesis testing investigation (cp. 9;1).
6;1;3
167
6. O B J E C T I V I T Y : A. T H R O U G H T H E E M P I R I C A L
CYCLE
inclined, he can replicate the investigation-1 Again, it is possible to make recommendations on reporting techniques; but this still does not give us strict criteria for objectivity. As regards 'evaluation in the proper sense,' one can at most stress the general importance of objectivity and perhaps make some corresponding recommendations on interpretative techniques (cp. 9; 2). There is no real need for strict rules; for, once a sound and sufficiently objective report is available — one which is not itself contaminated by the writer's personal evaluation — subsequent tendentious evaluations cannot do irreparable harm. They will be open to criticism after publication, on the basis of all the published facts. As far as the second phase is concerned, the requirements for objectivity have been extensively dealt with in Chapters 3 and 4 under the headings: logic and formulation. In this context, the antithesis 'objective-subjective' was embodied in the discussion of the principles of testability (4; 3; 1) and explicitation (4;3;4). In any event, the second phase need no longer detain us here. As for the purely deductive part of phase three — the inferential steps of types gl and pi (3; 2; 1) — these are likewise primarily subject to criteria of logic. However, the third and fourth phases, comprising the experimental design and actual testing, are those in which problems of objectivity come most characteristically to the fore. Here the notion of 'objectivity' can often be strictly applied, in an absolute sense. Objective criteria can be set up for it; 'objective' techniques can be developed. Specifically, this can be done for: (1) the empirical specification of concepts or constructs, (2) the selection of the empirical materials for hypothesis testing, including the composition of experimental groups and sampling procedures, (3) the processes of observation, registration, and data processing, (4) the set-up of the investigative procedure (the research design and plan). These four topics can be reduced to two. First, in the discussion of problems of objectivity we can do without category (4) if the other categories are given a sufficiently broad reading. Every decision of an experimental condition generally results either in an empirical specifi1
For practical reasons (space in journals, readability, and the like) this rule sometimes has to be relaxed. Its modified form, then, says that the report may indeed be concise, but that further, more detailed data (including the original materials) must for some time remain available for study by others who might wish to do so.
168
6; 1; 3
6;2
FROM
C O N S T R U C T TO OBJECTIVE
VARIABLE
cation (1) or in a further specification (selection) of the set of data (2), or both. Secondly, in a hypothesis testing investigation — and that is what we are concerned with (but cp. 9;1) — the characteristic feature of (3), i.e., of observation, registration, and data processing, is that these activities serve but one purpose, that of testing a hypothesis stated in terms of constructs. Every observation relevant to hypothesis testing either amounts to or contributes to the determination of the 'value' of a variable — whether quantitative or qualitative (cp. 7; 2; 2) — a variable which in turn represents a construct. Objectivity of observation, registration, and data processing (3), therefore, reduces to (1): objectivity of construct specifications. So we need no more than (1) and (2). Problems of objectivity and objective methods in the empirical specification and instrumental realization of constructs (1) will be dealt with in terms of the construct itself in 6; 2, and in terms of the observation process in Chapter 7. Problems of objectivity and objective methods in the selection of the sample materials against which the hypothesis is to be tested (2) will be discussed in 6; 3.
6; 2 F R O M C O N S T R U C T T O O B J E C T I V E
VARIABLE
6; 2; 1 Instrumental realization; definitions
The significance of empirical specifications and the role objective methods can play in arriving at them are best studied from the viewpoint of the instrumental realization of a construct. In our discussion, therefore, we shall be moving from construct to variable, and indeed to an operationally defined variable, preferably of an objective nature. Before we embark on this subject, we must define a few terms. First, at what point does a 'construct' or 'concept' — or what in common parlance is frequently termed a 'factor' — become a 'variable' in this process ? The basic feature of a variable is that it varies or can vary. An empirical variable, in the behavioral sciences, is a factor whose variation is under scrutiny in a given investigation. For our purposes one more qualification must be added: we will not use the term variable unless the instrumental realization of the given construct is settled at least in principle. 6; 2; 1
169
6.
O B J E C T I V I T Y : A. T H R O U G H T H E E M P I R I C A L
CYCLE
If this is not the case, then we shall continue to refer to the 'concept,' 'construct' or 'factor'—of e.g., hostility, intelligence, age. In the examples discussed in the preceding chapter, which were derived from B A R E N D R E G T (1954), the instruments were: for assessing 'hostility' (harboring of hostile wishes), the above-mentioned Elizur index (5; 2; 3), for 'intelligence,' the Wechsler-Bellevue scale with the standard instructions and scoring rules; for 'age,' the registration of the patients' stated ages, in years. Once all these are known, we have sufficient information to speak of 'variables' in the sense just stated, even if not all the details of the operational definition have been settled. If more is known — further details of the manipulation of the instrument; the range of quantitative or qualitative values which the variable can assume; the scale used in scoring; the frequency distribution in the population, etc. — the variable still remains a 'variable.' The term 'variate,' frequently employed for a variable with a given or assumed frequency distribution (e.g., M A X W E L L 1958), will not be used. 1 The operational definition of a variable is not complete until detailed specifications of the instrument by which, in a particular case, its value is to be determined are available. This instrument, then, comprises a complete set of instructions specifying how empirical (experimental) data must be collected, registered, and processed to make possible the determination of the value of the variable. A variable is completely objective if all the instructions needed to determine its value are 'objective.' Objectivity can be defined, and must be demanded, for each separate step or operation: for the collecting of data and for the details of the experimental procedure, for the rules for eliminating those cases in which the variable is considered not to be applicable (cp. 6; 3), and for processes of observation and registration, such as classification, scoring, combination and calculation of outcomes, etc. Every decision in this process, every distinction, every assignment to a particular class, every mathematical manipulation must be 'objectively' specified. Strict stipulations for objectivity can be made here: a step or instruction is objective if its execution can nowise be disturbed by subjectivity, that 1
Nor need we be concerned here with variables stripped of all empirical content, as met with in logic, mathematics and statistics. What interests us in this and the next chapter is precisely such content, or at any rate the manner in which content can be instrumentally realized.
170
6; 2; 1
6; 2
FROM
C O N S T R U C T TO OBJECTIVE
VARIABLE
is, if no 'subject' (in the sense of a human judge) is required. In other words, it is objective if the instruction can be carried out by a clerk who is completely ignorant of the field in question. Yet another way of saying this is: an objective instruction can in principle be translated into a program of single-valued transformations for a determinate machine1 (cp. A S H B Y 1957, Ch. 3 if.). As is well known, such 'translation' — and more generally, replacement of the subjective human observer and judge by a 'machine' — is no longer a mere theoretical possibility but one that has increasingly found practical application. Photographic registration, tape and film recordings, and electronic devices are superseding the human observer; mechanical scoring devices the human judge; and computers the human data processer. Whenever such a complete replacement is possible, whether in reality or in principle, the procedure is 'objective.' Deficiencies on the score of objectivity may be twofold: absence of explicit instructions, or presence of instructions which rely on a judgment, an assessment for which no completely objective standards are prescribed. In scientific research neither defect need be fatal or even dangerous: it may well be that the distinctions to be made are based on common sense or on universally accepted, tacit conventions or criteria. Also, it is often very difficult to treat all the details of the third and fourth phases exhaustively in the explicit instructions so as to eliminate each and every form of human judgment. Even in the case of a strictly precoded test or questionnaire (cp. 7;1;1), there may be difficulties in judging unforeseen cases where the subject has not completely followed instructions (cp. 7; 1; 3). In principle, however, preformulated objective solutions are possible for every detail. The requirement here advanced of the absolute objectivity of a variable is an ideal. It cannot always be fully realized but it must be approximated as closely as possible. 1 'If the procedure can be programmed for a digital computer, then it is completely objective' (GREEN 1961, p. 85). The definition in the text is open to two objections: 1. A probabilistic machine also is objective — but we have no call for it here (however, cp. 6;3); 2. The program itself may be biased, may e.g. be based on a 'subjective' selection (have built-in preconceptions) — but this form of subjectivity will be taken care of in the next chapter (7;3). In the technical sense intended here, a machine program impaired by built-in subjective failings is 'objective.' The essential point is that a machine program can be reproduced or published in its entirety, without losses, so that any criticism on the score of possibly inadequate instrumental realization can stand on a factual basis and be made publicly.
6;2;1
171
6.
O B J E C T I V I T Y : A. T H R O U G H
THE EMPIRICAL
CYCLE
6; 2; 2 The evaluation problem as an example; goal, effect, measure
An important type of problem which is time and again found to hold a key position, in particular in applied areas, is how to evaluate objectively the effects of methods aimed at influencing human behavior. This 'evaluation problem,' as it will henceforth be referred to, 1 provides a suitable illustration for a more detailed discussion of the problems of objectivity as they figure in the instrumental realization of constructs. The evaluation question in this sense is apt to arise in all forms of education, training, and schooling, of psychotherapy and counseling, and of propaganda and advertising. What is the effect of method A; what is attained by it? Usually, certain goals have been set in advance — albeit, often only vaguely formulated, initially — so that the question is: How far are these goals, these objectives, achieved? Frequently the problem takes the form of a comparison between two methods. Is method A better than method B ? Can the superiority of a new method A, over an old method B, be demonstrated by an objective, comparative evaluation of the effects of A and B? ('Method B' can possibly be 'no method'). The question then is whether, compared with the base line provided by B, any significant effect of A (however small) in the desired direction is demonstrable. Generally, the objectives and the desired effects may be considered explicitations of a — perhaps rather embryonic — theory. These explicitations can be stated as hypotheses, so the evaluation problem amounts to hypothesis testing. The crucial part of this problem is the instrumental realization of the concept of 'effect.' Usually it is known, in quite general terms, what the attempt to influence behavior is expected to achieve. Usually, too, there will be available a verbal description of the merits attributable to method A, according to the theory, and of the effects that can supposedly be obtained with it: e.g., 'better insight,' 'improved skill,' 'improved adjustment,' 'a more positive attitude towards some X,' or alternatively 'the will to act positively (e.g., to buy).' The problem, however, is how to devise empirical measures that are reasonably representative of such constructs in their intended sense, as well as objective. 1 The term 'evaluation' has a different meaning here than it has in connection with the fifth phase of the cycle (1;4;6). There it designated: the interpretative overall appraisal of the value of research findings for theory (or application purposes). Here it is: an assessment of the value of (the effects of) a method of influencing human behavior.
172
6; 2; 2
6; 2
FROM
C O N S T R U C T TO OBJECTIVE
VARIABLE
The instrumental realization process can be roughly divided into two parts: (1) from objective or goal statements to effect specifications, and (2) from specified effects to objective empirical effect measures. We shall first concentrate on (1). If the goal setting is to encompass the evaluation problem as a whole, more than one goal construct is generally needed to cover what the method of influencing behavior is expected to achieve. This means that task(l) consists of, first, choosing adequate goals and, second, specifying each of them in terms of behavioral effects. Moreover, their respective functions and relative importance must be precisely specified in advance. 1 In many types of evaluation studies, these activities are likely to be carried out in group discussions among experts. They must decide on goal statements and, for each goal, arrive at a workable agreement on what it — 'the construct in its intended sense' (p. 172) — amounts to in terms of expected behavioral changes (effects). The task at hand is often not an easy one: the goal construct may be one of those lofty, ultimate terms, incorporating a 'philosophy,' which we are all ready to subscribe to but not prepared to be explicit about. The latter is, however, exactly what must be done and agreed upon: explicit behavioral effects or criteria must be identified which are regarded as significant, and in themselves desirable, deductive specifications of the goal construct in question. Needless to say, these effects must also be empirically determinable (objectively realizable).2 Examples of instrumental realization up to this point (1, above) could be taken from any of hundreds of educational evaluation studies. However, one instance will suffice: an experimental study of the merits of different systems of teaching plane geometry to 7th and 8th graders (cp. 1
It is true that this is essentially still a matter of theory and hypothesis formation (phase 2). It should be remembered, however, that this aspect is never entirely absent in instrumental realization: empirical specifications of constructs and concepts contribute to concretizing the hypothesis and shaping the theory. The evaluation problem, for that matter, has been taken up as an example expressly because of the scarcity, in this area, of established theory to start from. Here, the process of instrumental realization must therefore go all the way, from vagueness to precision. 2 Underlying the latter restriction is the sound idea that it is hardly meaningful to set a goal if one does not know how one can, in a given case, determine to what extent it has been achieved. One of the major advantages of projects involving objective realization of effect measures — apart from making evaluation possible — is that they can provide a more realistic ('operational') setting for the debate on goals and aspirations (in education, therapy, etc.).
6; 2; 2
173
6. O B J E C T I V I T Y : A. T H R O U G H T H E E M P I R I C A L C Y C L E
DE GROOT (1957, 1964) 1968; also WIEGERSMA 1960b). In order to assess those merits, a set of educational objectives deemed significant in this particular area had to be singled out for instrumentation. But what are the 'educational objectives deemed significant' in the case of plane geometry at the seventh grade level? There is no generally accepted theory; there are only widely divergent views of what the objectives of geometry should be. Even if the radically negative opinion that plane geometry serves no real purpose and ought to be replaced altogether with, for instance, set theory is left aside, there remains a host of different goal conceptions. Objectives may be stated in a limited sense, strictly geared to the program itself (learning how to solve certain types of problems), they may be seen as of wider scope (e.g., learning how to think, how to analyze a problem, how to apply systematic thinking techniques and general methods of problem solving, etc.; e.g., BOS 1955). Alternatively, one may put the emphasis on the spatial aspect: geometry as a means of developing a structured 'spatial awareness' (e.g., VAN HIELE 1957); finally, geometry may be regarded as the first introduction to a scientific and partly formalized deductive system. There are a great many opinions and deep convictions on the subject and but little agreement. It will be clear, however, that many of these broader objectives are based on the assumption of transfer to other areas, in a later stage of life, of the knowledge acquired in geometry classes. This assumption, however, is little substantiated (cp. e.g., WOODWORTH a n d SCHLOSBERG
1955, p. 829); furthermore, it hypothesizes effects far removed from the actualities of geometry instruction in the seventh grade. For the construction of an effect measure it was therefore considered preferable to stick closer to home base and there to look for commonly agreed on factors. The outcomes of the ensuing discussion and analysis were quite simple. Since geometry is certainly not intended to be a drill subject, every instiuctor will accept that a, not to say the, major purpose of his teaching is to instil in his students a degree of'insight' into the discipline of geometry itself. In addition, such minimal insight is a necessary precondition for any possible transfer effects. Also, it may be assumed that teaching is meant to stimulate, to excite 'interest,' in this case 'pleasure in geometry.' That this, too, is a basic educational goal is hardly questionable, even if actual classroom practice sometimes tends to overlook the motivation aspect and may even produce the reverse effect. Again, without minimal 174
6; 2; 2
6; 2 F R O M C O N S T R U C T T O O B J E C T I V E
VARIABLE
motivation of the student, further-reaching transfer effects can hardly be expected. These two objectives, 'insight' and 'pleasure in geometry,' were considered sufficiently specific and specifiable for the construction of effect measures. 6; 2; 3 'Insight gained': an objective instrument
As regards the second part of the instrumental realization process (p. 173: from specified effect to objective measure), two tests had to be constructed; first, a suitable achievement test to measure 'insight gained,' and second, an attitude test to measure 'interest generated' or 'pleasure in geometry.' We shall not now discuss in detail how this was done (see WIEGERSMA 1960 b ; DE GROOT (1957,1964) 1968); nor the guidelines for, and technical details of, test construction in general, which can be found in many textbooks (e.g., L I N D Q U I S T 1959, EBEL 1965, DE GROOT and V A N NAERSSEN 1969). Only a few remarks will be made on the first task—the harder of the two—designing a set of questions to assess the geometrical insight gained by pupils after one year of geometry instruction. The main problem was, again, that each of these questions had to be both relevant to the purpose envisaged (as regards content) and objective (in form). These two requirements are often difficult to meet simultaneously (6; 2; 4). Apart from its psychometric aspects (to be analyzed in Chapter 8, in particular 8; 2), 'relevance' can be viewed primarily as a matter on which, in evaluation research generally, experts must agree. That is, they must — for each question or item — agree that it does contribute to the measurement of the construct as intended (the purpose envisaged). 'Objectivity,' however — our present concern — can be precisely defined, for separate questions as well as for a test series as a whole. What does it, in concrete terms, imply? What requirements must be met by a test in general and by this test of insight in particular — to make it an objective instrument ? As we have seen, there must be clearcut, unambiguous instructions covering every detail of how the test is to be administered, scored and processed, such that a machine or a well-instructed clerk, although familiar neither with geometry nor with testing techniques, can in principle carry through the procedure. This implies, first, that there must be a fixed instrument in the narrower 6; 2; 3
175
6. O B J E C T I V I T Y : A. T H R O U G H
THE EMPIRICAL
CYCLE
sense, in this case a test form with questions. Then, there must be unambiguous instructions as to applicability, i.e., for what students, of what grades, after what geometry instruction, at what time in the semester, the test is to be administered and the 'insight' variable measured. Next, directions for the experimental procedure: detailed instructions for the teacher-experimenter on the circumstances and manner in which the tests are to be presented; for each test booklet a complete mimeographed explanation, which the experimenter is to read out to the class literally and leisurely, without any additions of his own; stringent instructions on how to deal with any questions and on the time available for each test, etc. Strict directions must be given for the scoring of the protocols (the answer sheets). For each item of each subtest there must be objective rules specifying which answers are correct, which false, and possibly which are to be discounted (neither correct nor false). If there are intermediate scores between (completely) correct and (completely) incorrect — which from the viewpoint of objectivity tends to be a questionable method — the scoring instructions must contain unambiguous directions for the item score(e.g., 0,1 or 2) to be awarded for a given answer. Again, all these instructions and directions must be so stringent that they can, if needed, be converted into a computer program. The same applies to instructions on how to combine (weight) item scores and possibly subtest scores into a total score for 'insight' as operationally defined. Finally, there must also be objective instructions on the manner in which these scores are to be interpreted (or the scale in which they are to be read, cp. 7; 2; 2). Only when all these steps have been objectively dealt with, can we say that the objective instrumental realization of the 'insight' construct is complete. 6; 2; 4 Objectivity and relevance
Fulfillment of the objectivity requirement, as encountered in the instrumental realization of constructs, is a technical matter. It is of course possible to make 'objective provisions' for everything — by using objective techniques (cp. Chapter 7). There is a risk, however, that technical perfection of the instrument will be achieved at the expense of content, at the expense of 'reasonable coverage of the intended meaning,' of the relevance of what is eventually measured with such perfect objectivity. In the process of constructing the geometrical insight tests referred to 176
6; 2; 4
6; 2
F R O M C O N S T R U C T TO O B J E C T I V E
VARIABLE
above, this dilemma of objectivity versus relevance manifested itself in the fact that the most relevant ideas for subtests were found to be most resistant to complete objectification. Technically, it would not have been difficult to cast them in a strictly objective mold, but that would have meant tampering with what was then considered the essence of the insight test. For instance, it was considered important to test whether a student had really grasped the principles underlying the construction of triangles, and it was thought that the best way to check this would be to have him actually carry out some. But then, what about objective arrangements for judging his products — botched drawings and the like? Or to take another example of a subtest considered highly relevant, a geometrical proof can be broken up into successive steps, for each of which completely objective questions can be devised. But inevitably one will then be testing something different from the student's ability to furnish a mathematical proof, that is, his insight into the proof as a whole. For information on how the dilemma was solved for these two (out of eight) subtests we refer to the report in question (cp. also Chapter 7). Our purpose here was merely to illustrate the problem. The dilemma is pointed up even more sharply by other evaluation problems. What do psychotherapy and counseling purport to achieve? How can objective effect measures be devised for improvement in mental health, social adjustment, the lessening of inner tension and problems? If, prior to therapeutic treatment, there were objectively observable symptoms, their disappearance would undoubtedly provide an objective indication. Its relevance, however, is limited. A patient who no longer displays certain manifest symptoms can still be as 'neurotic' as when he did. Also, symptoms may shift. Moreover, they are often not present in any clear, objectively ascertainable form, not even at the start of therapy. The demands of relevance and objectivity are very difficult to reconcile here. True, the work of Rogers and his school (see in particular R O G E R S and D Y M O N D 1954), for instance, has shown that the problem is not utterly unsolvable. The construction of one of their instruments was based on the simple idea that anyone who is maladjusted and needs treatment will at the very least feel some 'inner discord.' Therefore, one effect of psychotherapy should be a decrease in inner discord — or, a 'diminished discrepancy' between the subject's report of his 'ideal self' and his 'self-concept.' This construct was given instrumental realization by means of the Q-sort technique ( S T E P H E N S O N 1955), and with this 6; 2; 4
177
6.
OBJECTIVITY:
A. T H R O U G H
THE EMPIRICAL
CYCLE
instrument one of the main hypotheses derived from Rogers' theory of the psychotherapeutic process could be tested. Apparently, it is not impossible to construct instruments that are both relevant and objective, even in this area. Nonetheless, the evaluation problem — and the dilemma — has been solved only partially (cp. also m e e h l 1955; s n y d e r 1958; b a r e n d r e g t
1961, C h .
11).
Even more difficult is the construction of a suitable instrument for the evaluation of attempts to influence people's 'attitudes,' for instance through a training program or a course in industrial 'leadership' or, in a completely different sphere, through propaganda or advertising. The ultimate objective of such programs is to effect a change in human behavior: improved leadership, increased consumption, and so on. This kind of behavior, however, is frequently subject to such a time lag, is so difficult to pin down (objectively), and is, besides, dependent on so many other factors that the idea of constructing an 'ultimate criterion' must often be relinquished. Here, too, recourse is often had to attitude tests. These are objective questionnaires used for determining the general tone — from negative to positive — of a subject's responses to some particular institution or issue. The questionnaire items may relate to the pros and cons of geometry, practical leadership problems, household practices, beer drinking, and what not — according to the focus of the attempt at persuasion. Attitude tests are pleasantly handy and objectifiable instruments that can serve also for the evaluation of effects. But they have one glaring weakness : they elicit only verbal responses. In many cases one would like to measure the 'real' behavior, or the 'real disposition' toward a particular type of behavior, or another background factor; but one has to be content with less relevant data: verbal responses to verbal questions. Thus, for instance, an attitude test administered after a course for supervisors in industry can show how well the participants have absorbed and can now reproduce the human relations lesson obliquely introduced into the course. But this does not guarantee that they will act upon it when the attitude in question is to be manifested in a real situation. The problem is not so much that the responses are insincere. Sincerity can often be ensured by taking precautions in the selection and wording of the questions and the administration of the test (e.g., anonymous, away from the context of classes or persuasion efforts). Children and adults as well generally like to express their feelings and opinions frankly, if there 178
6; 2; 4
6; 2
FROM
C O N S T R U C T TO OBJECTIVE
VARIABLE
are no strings attached. Filling out a test form, however, remains an essentially different activity from, for instance, voluntarily taking extra hours of geometry, or restraining a sharp reaction in a supervisory capacity, or consistently buying certain products. In other areas besides evaluation the same problem occurs. Objective instrumental realization is time and again found to be especially difficult for those central concepts that one would particularly like to come to grips with: anxiety, repression, neuroticism, adjustment, democratic behavior, social class, status, role, set — just to pick a few examples at random. If one is unwilling to water down the objectivity requirement, one often has to be content with operational definitions that are rather weak on relevance. 1 In the behavioral sciences we are time and again confronted with a tension between, or rather mutual contrariness of, objectivity and relevance. Not too many years ago this contrariness still presented itself as a real dilemma: one had to choose between the objective and the relevant. There were not only two kinds of concepts, but two kinds of theories (philosophies, schools) as well: on the one hand, important but vague and not objectively realizable ones, and on the other hand, precise and objectively realizable ones that were not relevant, at least not to the questions to which one would like the behavioral sciences to provide answers. Psychology for instance could be divided into, on the one hand, an exact, objective, experimental psychology, which in the laboratory studied problems that were removed from real life, peripheral and, according to many, not 'relevant' — and, on the other, a, or rather many, non-experimental psychologies (psychoanalysis and others), no doubt dealing with 'relevant' problems and concepts but elusive from the viewpoint of objectivity. In sociology, economics, anthropology there were likewise divergent schools: on the one hand exact ones, restricting their domain to measurable data, on the other, 'wide band' schools of thought, given to verbal description and imprecise theorizing. The dichotomy is still there, and so are the 'schools.' The polarity of objectivity and relevance is still with us. But the more modern examples 1
In this book generally, and in the present context in particular, the term 'relevance' is used in a loose general sense, not as a technical term. What is meant by such an expression as 'not very relevant' is that — for a variety of reasons — one would really rather measure or categorize something else, which one holds more important, than the variable actually obtained.
6;2;4
179
6. O B J E C T I V I T Y : A. T H R O U G H T H E E M P I R I C A L C Y C L E
cited above show that there is no longer any reason to consider the problem incapable of solution. There has been a great deal of change in this respect: substantial progress has been made; the tension between the poles has appreciably decreased. A veritable arsenal of techniques and aids has been devised and introduced in the last three or four decades: for the objectification of data processing methods, of investigative procedures, and above all for the objective instrumental realization of refractory concepts — without undue loss of relevance. A number of results and aspects of this breakthrough will be briefly discussed in 6;2;5 and Chapter 7. 6; 2; 5 Development of instruments
If one compares the present situation in the behavioral sciences with that of 20, 30 or 40 years ago, a striking difference is that now far more standardized objective instruments of fairly general applicability have become available. Most spectacular, of course, has been the test explosion in the last fifty years. Escalated by huge testing programs of two world wars, the test movement has more or less overwhelmed the United States of America and gradually come to flood England and the Continent as well—where, for that matter, it had originated (e.g., BINET and SIMON 1908). For detailed information, the reader can now be referred, to scores of handbooks (e.g., C R O N B A C H 1960); for critiques, to thousands of test reviews (e.g., in B U R O S ' Mental Measurements Yearbooks 1938, 1941,1949,1953, 1959, 1965). By no means all the published, standardized mental tests, however, meet the criteria that must be set for a good instrument (Ch. 8). While objectivity is technically taken care of for the large majority of them, there is much less reason for enthusiasm about the relevance of many of these tests — to their own alleged purpose. There exist objective and relevant tests, however, in particular in the area of cognitive abilities; in psychology, developments in the field of intelligence (or general mental ability) have always served, and still serve, as a paradigm of successful instrumental realization. America has led the way, again, in furnishing the social and behavioral sciences with instruments for measuring variables not only of individua\ persons but of other systems or objects as well. These instruments represent widely divergent constructs: indices for the 'readability' of texts (FLESCH 1949), standardized rating scales for the 'social class' 180
6; 2; 5
6; 2 F R O M C O N S T R U C T T O O B J E C T I V E
VARIABLE
of citizens in a Western country (WARNER et al. 1949), a method for measuring the (connotative) 'meaning' of a concept (OSGOOD, SUCI and TANNENBAUM 1957), to mention but a few. Another important category of instruments are those devised for the measurement of higher order constructs, in particular of relationships among variables. The development of the art of data processing and the use of computers have given rise to the instrumental realization of a host of causal, structural, and other relational constructs of various types: the measurement of 'change,' 'distance,' 'congruence,' 'discrepancy,' 'components,' 'dimensions,' 'latent structure,' 'underlying factors,' 'intervening variables' •— not to mention the more common constructs and correlational variables of the reliability, validity, and 'internal efficiency' types (see Ch. 8). Even more important than the availability of large numbers of readymade instruments and operational definitions is the fact that researchers have greatly improved their skills in instrument construction and experimentation. In the behavioral sciences, in particular in field and applied investigations, instruments must often be constructed ad hoc. In that case, the instrumental realization of the construct may have no pretensions beyond a particular hypothesis testing experiment (or applied goal). The experimenter's art consists largely of forcing upon his experimental subjects — or more generally on his data — certain choices between a few pre-set alternatives without causing undue loss of relevance. A fine example of this art of devising objective ad hoc instruments can be found in the way Little and Cohen provided instrumental realization of the 'overambitiousness' toward asthmatic children shown by their mothers: they arranged an experimentally controlled setting in which the mothers predicted how well their children would do in an aspiration-level test (LITTLE and C O H E N 1951). Other examples can be cited from experimental social psychology, where it is quite common for constructs relating to small-group behavior to be given the sort of instrumental realization that is uniquely applicable to one type of experimental situation. Thus, in certain experiments on communication patterns the subjects are made to communicate only through written messages. Now the 'amount of communication' of subject A with subject B is defined simply as the number of messages sent by A to B within the (set) time limits of the experiment (BAVELAS 1950). Outside the laboratory this is clearly not a feasible proposition. 6; 2; 5
181
6. O B J E C T I V I T Y : A. T R H O U G H T H E E M P I R I C A L
CYCLE
Within the laboratory walls, however, instruments (variables) of this kind enable the investigator to run precisely designed experiments in which general theoretical relationships can be tested. According to the standards of methodological strategy, these same hypotheses are later to be taken up anew in, less 'precise' but more 'realistic,' complementary field research — in which constructs like 'amount of communication' will be newly operationalized to bring them more in line with real life situations. In applied areas, also, construction of the requisite objective instruments has more or less become commonplace: from teacher-made achievement tests to, for instance, the short-term instruments motivation researchers are apt to need. At least some of these instruments, however ephemeral their nature, can stand the tests of objectivity and relevanceby-agreement. In many European countries, too, this technical, 'instrumental' evolution in the behavioral sciences is in full swing — although the lag behind the United States is still substantial. Psychology generally seems to lead the way, followed at some distance by education, sociology, political and other social sciences — all of which are in process of becoming more and more empirical. It should be noted that Europe's leeway might have, apart from obvious disadvantages, the compensatory advantage that approaches that lead nowhere, and other mistakes made in America, can be avoided.
6; 3 O B J E C T I V E S E L E C T I O N E X P E R I M E N T A L (TESTING)
OF
MATERIALS
6; 3; 1 Universe and sample
In addition to the instrumental realization of constructs, the selection of the data to be generated and analyzed is a recurring basic research activity, in which problems of objectivity are likely to arise. After specifications — usually particuIarizations — of the constructs in the given hypothesis have produced the operational form in which it is to be tested, one more preparatory step is required, namely the selection of the data in relation to which the operational hypothesis will be converted into a prediction. In other words, while hypotheses, even in their most highly specified operational form, relate to universes, predictions relate to samples, which are to be drawn from the corresponding universes in a specific manner (cp. 3;4). 182
6; 3; 1
6; 3
OBJECTIVE SELECTION
OF EXPERIMENTAL
MATERIALS
Let us return to the illustration cited in Chapter 5, Barendregt's investigation. Here the universe for the experimental group consisted initially, i.e., in the hypothesis as derived from the theory, of 'all asthmatics.' Only a subset was studied, however: hospitalized, male asthmatics who conformed to certain statistical requirements with regard to age, intelligence, and occupational level. This means that the specific, experimental hypothesis-as-tested relates to this subset or sub-population only: those asthmatics who conform to the criteria and restrictions just mentioned. Barendregt's experimental group of twenty patients thus was a sample from this 'experimental' or operational population. If we take a closer look at the content of the operational hypothesis and of the actual prediction respectively, we must go even one step further. Strictly speaking, both related, not to patients as human beings, but to their Rorschach responses, specifically their scores in Elizur's hostility index. Once we take this step, we shift from the population of human beings to the universe of obtainable scores.1 This makes an important difference. In the latter formulation, the extent to which the scores are to be considered adequate as attributes of the person is obviously posed as a separate problem; the hypothesis in its operational form now refers exclusively to the scores as they are obtained. We know that both experimenter E and judge J can influence the outcomes (cp. 5; 2; 4 and 5; 3; 2); we also know that Rorschach indices are, generally speaking, none too reliable; but all this is irrelevant to the content of the hypothesis if we take a strictly operational view. The hypothesis then relates to the specific universe of all the Elizur indices obtainable through Rorschach experimenter E and judge J for all the asthmatics who conform to the above specifications of the operational population. We call this universe of obtainable scores the operational universe. The actual hypothesis testing amounts to examining a sample from this operational universe. Correspondingly, the purely statistical generalization does not go beyond con1
The terms 'universe' and 'population' are often considered interchangeable. In the present text, the term 'population' is preferably used to denote a set of individuals whereas the set of measurable attributes under scrutiny — the Elizur scores of the asthmatic individuals — is generally referred to as a 'universe.' In other areas, too, the availability of two terms may be useful: one for the collections of the objects or systems, one or more attributes or properties of which we are studying, and one for the collections of determined or determinable (measured) attributes or properties themselves, belonging to those objects or systems (e.g., a 'population' of nails, but a 'universe' of measured or about-to-be-measured nail-thicknesses).
6;3;1
183
6.
O B J E C T I V I T Y : A. T H R O U G H
THE EMPIRICAL
CYCLE
firmation of the (operational) hypothesis regarding this (operational) universe. This example shows how important it is to analyze the planning of a hypothesis testing investigation in terms of universes (or populations) and subuniverses — with special attention for the effects of particularizations of the problem (5; 2; 2), of empirical specifications (operational ization) of constructs (5;2; 3), and of the specific conditions of the experimental design (5; 2; 4). Only after and by means of such an analysis can the generalization steps of the confirmation and evaluation process — i.e., of 'the way back' — be clearly distinguished. Now what precisely does 'drawing a sample from a universe' mean ? Although the phrase may sometimes carry the connotation of random selection, this is by no means an indispensable feature of its scientific usage. A sample may also be composed on the basis of classifications by variables that are considered of interest for the investigation (stratified sample). Nor is a sample necessarily 'representative' of the universe from which it is drawn; a 'biased' sample is nevertheless a sample. If the sample concept is used in the context of scientific or at least generalizationoriented research, however, the idea — the implicit intention — is generally that the sample is 'representative' for the purposes envisaged. We thus arrive at this minimally delimitative definition: 'Sampling a universe' means selecting and earmarking a subset of elements from the universe for a closer examination, aimed at drawing conclusions not only regarding the subset itself but also with respect to the universe. Not all cases where research data are selected fall within this definition. Suppose a research project encompasses a universe of a few dozen or hundred cases, say the population of Roman Emperors, and suppose that in a study of the limitations on their power four or five emperors are left out because they ruled too briefly or because they are for other reasons considered inappropriate cases; then, there is indeed a 'selection of research data,' but no sampling. There is no intention of expanding the conclusions to the entire population; this is a case where the population itself is being restricted. Objectivity issues also may occur in this case; see further 6;3;4.
184
6; 3; 1
6; 3
OBJECTIVE
SELECTION
OF EXPERIMENTAL
MATERIALS
6; 3; 2 Diversity of universes
Many different kinds of universes may be distinguished. First, there is a quantitative distinction, according to the number of variables or characteristics for which each element of the population is considered to have a certain quantitative or qualitative 'value.' The number of variables is sometimes called the number of 'components' — when each element is regarded as a vector — or alternatively the number of 'dimensions' or the 'dimensionality' of the universe. The latter term is somewhat confusing, however, since it is often used in a different sense. In this book we shall refer only to the number of variables. Another formal distinction is that of finite and infinite universes. Empirical universes — e.g., universes consisting of the variable values of an (empirical) population — are of necessity finite; infinite universes have of necessity a hypothetical character. They are often used as a conceptual model, for instance in dealing with variables determined in experiments that can 'in principle' be infinitely repeated. The operational universe can then be envisioned, for instance, as the collection of all the outcomes of 'identically' designed experiments with other samples from the same population. Even when there are practical limitations on the recurrent drawing of fresh samples of 'objects' (or individuals) from the population — which is hardly ever the case in the physical sciences, but a regular feature in the behavioral sciences — the hypothesis testing procedure can be treated statistically as if it involved an experimental sample from an infinite universe of possible samples.1
From these empirically based but fictitiously infinite universes, of events (e.g., experimental), instruments (parallel tests), variations in conditions, experimental subjects (human, or animal), outcomes (a subject's scores on replicate tests), it is but a short step to the theoretical universes of numbers or other abstract symbols, which are the statistician's domain. Theoretical universes, in particular theoretical distributions of the variables in such universes, are used as models for empirical universes, i.e., as models of what the distributions and derived parameters in empir1 Actually, the question whether repetition 'ad libitum' is practically feasible, is hardly ever considered, not even in experiments for which there are needed, for instance, rare and slow-breeding animals, or very special experimental subjects (e.g., male hospitalized asthmatics, etc., see above), whose usefulness is restricted to one experiment. In some cases the infinite universe is almost entirely fictitious, as for instance the universe of parallel test scores in test theory (see e.g., G U L L I K S E N 1950) hypothesized to define a person's 'true score' (cp. 8;3;2).
6;3;2
185
6. O B J E C T I V I T Y : A. T H R O U G H T H E E M P I R I C A L
CYCLE
ical universes would be like if certain theoretical suppositions (e.g., a null hypothesis and the fiction of unlimited repeatability) were strictly applicable (cp. 7; 2; 3). The use of such models makes possible the statistical treatment of hypotheses and sampling outcomes. Of course, a theoretical universe need not be infinite; the numbers 1 to 10 also form a collection of elements with a variable characteristic. Finite empirical populations and universes can be subdivided into closed and open ones. For a closed universe the bounds are set; in principle, if not always in reality, the elements can be enumerated and counted. With an open universe, on the other hand, the definition of an element does not preclude the possibility of new cases that qualify as elements. The distinction between open and closed universes is not restricted to populations which are apt to be statistically treated as infinite (e.g.,'all asthmatics,' cp. 6; 3; 3, p. 190 ff.). It is also of importance with regard to finite empirical universes (or populations) that are expressly to be treated as finite — such as occur mainly in, non-experimental, investigations of existing materials. An example of a closed universe is the population of Roman Emperors mentioned in 6; 3; 1; another, the universe of (characteristics of) recorded medieval St. Nicholas legends in the western tradition (cp. 9;2). If, however, the investigator's objective is to construct and test a political science theory on the relationship between, on the one hand, the degree of unanimity within the two major American political parties in the months preceding the party conventions and, on the other, the subsequent election results ( D A V I D 1960), then the universe is expressly open. The researcher engaged in a subject like this may, for that matter, take a variety of positions, i.e., make a choice as to what is to be his universe, and what is to be his sample. To begin with, he can, if he so wishes, consider the (variables of the) political histories of the 64 earlier American presidential elections as the universe itself and treat this in a purely descriptive manner (cp. 9; 1; 4). However, as soon as his treatment becomes interpretative (9; 1; 6), and most certainly when it becomes exploratory (9; 1; 5), i.e., when there is an explicit search for general regularities and laws governing the phenomenon of 'American presidential elections,' he introduces an inductive element. The 64 cases are no longer a universe. But neither are they a (test) sample; they are the 'substrate materials' for hypothesis construction, implicit in the case of interpretation, explicit in the case of exploration (cp. 2;2). The next 186
6; 3; 2
6; 3 O B J E C T I V E S E L E C T I O N O F E X P E R I M E N T A L
MATERIALS
election outcome can now serve as a test case, and thus constitute a sample of size 1 (cp. 9; 2; 3). Alternatively, the 64 cases themselves can be the sample, notably when specific hypotheses on (American) elections have been obtained as operational specifications of more general hypotheses (e.g., concerning mechanisms operative in canvassing by two powerful rival groupings in democracies in general). These hypotheses will naturally have different empirical underpinnings; e.g., research on nonpolitical associations and clubs, or political science studies conducted in other countries. An intermediate form — which, unfortunately, finds scant application, cp. 9; 2; 5 — would be one in which the investigator deliberately develops his hypotheses on the strength of a section of the American election materials, e.g., a randomly selected half of the 64 presidential election histories (expressed in variables of the American, open, universe) and then proceeds to test these on the hitherto unstudied and hence 'new' other half of his sample. These considerations, and particularly the last example, show how shifts in postures, views, and methods during an investigation — which often intrude unnoticed, almost at the drop of a word, particularly in an informal discussion of a problem — will constantly bring about changes in the relative positions of sample and universe (and hence of hypothesis formation and hypothesis testing). This is a general phenomenon. In scientific thinking, particularly in the planning of a hypothesis testing investigation, one can and must frequently adjust one's approach in terms of different universes and samples. In test construction, as another example, the items chosen may be regarded as a sample, subject to certain requirements of representativeness, from the universe of all possible items (cp. 8; 2; 3 under content validity). From the viewpoint of reliability, in particular in defining a subject's true score (cp. footnote p. 185), his obtained score may be regarded as a sample 'of size 1' from the fictitious universe of all his possible parallel scores. The subject himself, with his scores and possibly other variables, is an element of the population, the experimental group a sample from that same population; this population, again, can and often must be viewed and defined in a variety of ways; all males, all male asthmatics, all hospitalized male asthmatics approximately forty years old, living in Amsterdam, etc. Every restriction of the population, and likewise every choice of a restrictive experimental condition (e.g., of the experimenter), may again be regarded as one from a universe of all 6; 3; 2
187
6.
O B J E C T I V I T Y : A. T H R O U G H
THE EMPIRICAL
CYCLE
similar restrictions possible (cp. 5 ;3 ;3). In statistical testing, the outcomes are regarded as results obtained with a sample from the universe of outcomes that are possible, on the assumption that the null hypothesis is valid; finally, the end result (e.g., significance at the 5 % level) may again be regarded as an element from the universe of all outcomes of similar (possible) investigations. So, the investigator must be able to make constant adjustments in his approach, in his conception of the universe-sample relationship, as his problem formulation and his experimental design evolves and takes on new aspects. However, if we confine our attention to the deductive line, which, when once established, will in the case of a simple hypothesis lead directly from theory to prediction, the procedure can be described in fairly simple terms: as a series of choices (5; 1; 1), as a series of deduction and specification steps (3;2), as a progressive operationalization of the hypothesis through instrumental realization of constructs (6; 2), as a process which leads from the hypothesis-as-derived (from the theory) to the operational hypothesis-to-be-tested (5; 2). Finally, as we saw in 6; 3; 1 — again using the asthma illustration of Chapter 5 — each of the specifications that lead up to the operational hypothesis can again be described as a specification of the universe. Only the (logically) final step, leading from operational hypothesis to prediction, i.e., the step in which the testing materials are definitively selected, can be described neither in terms of operationalization nor in terms of modifications of the universe. It has its own individual character and therefore poses its own peculiar objectivity problems. These will be briefly discussed below, first as regards sample selection in the statistical sense (6; 3; 3), and subsequently with regard to other problems posed by data selection (6; 3; 4). 6; 3; 3 Objective sample selection
The selection of a sample from a universe (or population) is made for the express purpose of generalizing certain pre-determined findings in the sample materials to the universe. Clearly, this selection ought not to be determined by subjective factors — with the possible consequence that those cases will be selected in which the prediction stands a better (or poorer) chance of being fulfilled. Experience has shown, however, that in somewhat complicated problem situations subjective and other systematically 'disturbing' factors may obtrude in various ways that are often difficult 188
6; 3; 3
6;3
OBJECTIVE SELECTION
OF EXPERIMENTAL
MATERIALS
to foresee (6;1;2). Hence, in this area, too, an objective technique is highly desirable, that is, an objective technique of sample selection. This purpose can be achieved by basing one's choice on a principle guaranteed to be entirely unrelated to the problem in question. The only 'principle,' however, for which such a guarantee obtains at all times is that of lot: the choice must be made 'by chance,' completely at random (HEMELRIJK 1961). Therefore, a frequently used technique of choice is that of random sampling. In concrete terms, in drawing a sample n from a universe of N elements, one adopts a procedure in which every possible combination of n elements from the given N stands an equal chance of being chosen. Any statistical determination of significance — with a view to expanding sample findings to a universe — is in principle always based on the assumption of such a random sampling procedure. The question of how to choose n from N, i.e., how to draw a sample of a given magnitude from a finite closed universe, is a standard problem in social science research. It occurs regularly in, for instance, public opinion polls or, to take examples from some entirely different areas, in quality control (spot checks of mass produced articles) or in statistical research on idiomatic usage (e.g., GUIRAUD 1954). Besides simple random sampling, the basic principle of which has just been described, textbooks of social research methods (e.g., SELLTIZ, JAHODA, DEUTSCH and COOK 1959; FESTINGER and KATZ 1953) recommend other methods. Statistical or practical advantages may sometimes accrue from dividing the population into 'strata' (e.g., demographically by county, religion, or ethnic grouping) and then drawing a sample from each stratum. According to the research goals, sample sizes may or may not be proportionate to the sizes of the strata in the population; subsequently, the outcomes for all the strata may or may not be combined. Sometimes a problem will call for the use of larger units or sequential procedures, in which for instance schoolgoing children are first grouped by school or city-dwellers by family, block or district. The next step is then to take a sample of these larger units from the universe, whereupon either all the individuals per unit are incorporated in the sample (cluster sampling) or a sample is again taken from within the unit. Still another method is 'systematic sampling,' i.e., selection according to a principle other than pure chance, but which, in the present situation, is also 'guaranteed to be entirely unrelated to the problem in question' (see above). Such guarantees will 6; 3; 3
189
6.
O B J E C T I V I T Y : A. T H R O U G H T H E E M P I R I C A L
CYCLE
be all the more stringent as the principle in question is more 'gratuitous' with regard to the problem in question: one chooses, say, every tenth house on a street, every twenty-fifth name in a list, or one picks only people with family names that have an e as their third letter, or something of the kind. For statistical and other details of all these procedures, we can refer to the literature of the subject (e.g., KISH 1965). Greater complications may occur in the case of open universes and open populations. To illustrate them, it is instructive to compare the problems met in hypothesis testing in psychology with those in public opinion polling. In the latter case, it is comparatively easy to state the population to which the generalization from sample findings will relate: the fourth criterion mentioned in 3;1;5 (stated empirical references) is not difficult to meet. Since what is wanted is a polling of opinion here and now, the population is simply, and literally, the 'population' of a given country, city, county, or at any rate a definite sector of it (e.g., all living adult males in Amsterdam). The question of how to draw a sample, n from N, from such a closed population is merely a technical problem. Psychology, on the other hand, claims to discover general laws that apply either to 'all people' or to specific sub-populations (e.g., 'all asthmatics'). In any case, its populations are not restricted to people living now, so they are 'open.' There have been people in the past and there will be other people in the future to whom, we trust, the laws discovered would have applied or will apply. But we know, also, that the human psyche evolves, that human behavior is dependent on the culture in which the individual lives, and that disorders, too, are apt to change with changes in the culture. Consequently, we are dealing with a population, the elements of which can neither be counted at one particular point in time nor even be defined without dependence on cultural factors that may change. But we also assume that these changes can be neglected, at least, up to a point, which we can not sharply define. It is difficult to draw a representative sample from such a population. There may be other complications, which can again be illustrated from Barendregt's investigation. In substance this was, as we have seen, a comparison between the 'hostility' manifestations of two groups, ulcus patients and asthmatics. In Barendregt's experimental set-up, the possible influence of intelligence was, along with some other factors, eliminated by statistically matching the two samples. If we assume, however, that 190
6; 3; 3
6,3
OBJECTIVE SELECTION
OF EXPERIMENTAL
MATERIALS
ulcus patients are on the average more intelligent than asthmatics — there are in fact some indications pointing that way — then the matching procedure itself has made it impossible for both experimental groups to be samples representative of their populations! In other words, realistic considerations with regard to confirmation (5; 1; 2) and practical considerations (5; 1; 3) led Barendregt to adopt an experimental design which in itself precluded statistical generalization to the populations of 'all' asthmatics and 'all' ulcus patients — supposing these populations could be clearly defined. Evidently, there is but one solution to this problem. As a consequence of an experimental design chosen on valid theoretical and practical grounds, we must accept the fact that the operational universe to which the operational hypothesis relates has been narrowed down. In general terms, once a sample has been constructed under deliberately introduced and clearly definable conditions and restrictions, an operational universe is implicitly defined for which this sample may be considered a representative subset of randomly chosen individuals. Or, to put it differently again, to solve the problem of objective selection, the universe rather than the sample is adjusted to the requirements of statistical generalization. The essential questions then are, first, whether the restrictions and conditions introduced were really warranted, and secondly, how far the findings for the operational universe possess confirmation value with respect to the hypothesis as originally intended (generalization problem). For a discussion of these questions, however, the reader is referred again to 5; 2 and 5; 3; 3. In practical problems of prediction and testing, where nothing is to be gained by adapting the (open) universe to the sample, the composition of suitable samples often poses an even more difficult problem. If, for instance, an investigator wants to study to what extent previous school grades or test scores have predictive value for successful study at a given academic institution (validation research in the sense of hypothesis testing), the universe he will have in mind may be roughly: all applicants in, say, the next ten years, on the (generally incorrect) assumption that neither the general run of applicants, nor the curriculum, will change materially. At the time planned for the investigation, random sampling is clearly impossible for this unmaterialized universe. What is possible, of course, is to take as the universe a random sample drawn from one year's entrants, or alternatively to investigate the entire one year universe 6; 3; 3
191
6.
O B J E C T I V I T Y : A. T H R O U G H T H E E M P I R I C A L
CYCLE
— frequently the best method. But then, afterwards he will have to face difficult, if not really unsolvable, problems of confirmation and generalization. A little easier to solve are problems like the following. In a clinical research project the objective may be to make a comparative evaluation of two forms of psychotherapy, A and B, for an identical population of, say, neurotics (cp. 6; 2; 2). How is one to compose, randomly or systematically, comparable groups (samples) ? How are patients to be assigned to A or to B ? In such problems, any form of taking into account, say, the seriousness of the patient's complaints, his status, social or age, etc., is likely to introduce a contamination (cp. e.g., the criticism leveled at R O G E R S and D Y M O N D 1954 in E Y S E N C K 1961). So the selection procedure must be as meaningless as possible, be 'blind' — as justice is blindfolded. Sometimes a practicable solution is to have one of the team members preselect the patients on objective grounds for inclusion or rejection, whereupon a separate randomizing procedure is used to decide for each patient whether he is to be given therapy A or B. The practical and social problems encountered in sample composition are often the most difficult to solve. If, in the last example, the clinician in charge is privately convinced — rightly or wrongly — that some serious case X can more successfully be treated by A than B, the chances are that he will not agree to X's inclusion in the B sample, regardless of how it was decided — an instance of a conflict between two ideologies which is not easy to resolve. It must, however, be resolved, in advance; too many studies in the past have been invalidated by lax compromises. As another example, if, in systematic sampling for an opinion poll, some part of the inhabitants of say 'every tenth house on the street' are not or pretend not to be at home, a current practice is to take the house next door. But this involves the risk of the selection process favoring the-stay-at-home, or the more accessible, more talkative, or more interested respondents. Even more clearly apparent are such problems in written questionnaires, which one cannot be forced to answer, or in testing sessions where attendance cannot be made compulsory. If participation is on a voluntary basis, problems of objectivity — response bias — are often almost insoluble. The only solution is to try and manipulate the conditions of participation in hypothesis testing investigations in such a way as to reduce refusals or non-attendance to a negligible proportion. 192
6; 3; 3
6; 3
OBJECTIVE SELECTION
OF EXPERIMENTAL
MATERIALS
6; 3; 4 Objective elimination
Using elementary classroom settings in the Netherlands and the United States, VAN B U S S C H B A C H (1952-1958) conducted guessing experiments aimed at demonstrating the existence of extra-sensory perception (ESP). The teacher, invisible to the children, was the 'sender.' At a speed indicated by the experimenter tapping a stick on the floor, she had to concentrate successively on one of three figures in a prescribed order. The children were made to guess which of the three figures the teacher had in mind: they marked their choice by checking the figure in question on printed forms. The forms contained, in varying order and arranged in columns of twelve, as many groups of three figures as there were taps. In its simplest form, the experimental question was whether the overall number of correct guesses by the children would be significantly in excess of what was to be expected on the basis of pure chance (i.e., one in three). In few fields are there so many objectivity problems and such treacherous possibilities of contamination as in this type of investigation. In many cases — though not in all — it has been found that investigators who believe in ESP (like Rhine, see R H I N E and P R A T T 1957) are likely to obtain positive results in experiments involving telepathy or clairvoyance, whereas those who are skeptical are not — without it being clear if and where contamination occurred in the former case. In this field the most rigorous experimental conditions must be maintained. The possibility of unwittingly given, and unwittingly but still sensorily perceived, signals or hints must be precluded altogether. Thus, the children must not see the sender (teacher) during the experiment, and neither must they hear her in any way. The signal for the next 'go' was therefore given by the experimenter (Van Busschbach), who, also, could not see or hear the sender during the experiment. Nor must the experimenter have any indication as to the order in which the sender concentrates successively on the different figures; any hypothesis he forms on this point must be futile. 1 It will be clear that the solution to this last problem lies again in randomizing the choice of the figures to be looked at. The teacher must 1
The possibility that the experimenter himself may be 'telepathically sensitive* and unwittingly transmit his guesses to the children — through sensory means even though he may possibly be quite unaware of this — cannot be entirely ruled out in this experimental setting. This, however, appears a considerably more complex supposition (which, anyway, is also based on telepathy or clairvoyance) than that of a direct telepathic contact between the teacher and her pupils.
6; 3; 4
193
6.
O B J E C T I V I T Y : A. T H R O U G H
THE EMPIRICAL
CYCLE
let her choice be determined by following an arbitrarily chosen series — naturally unknown to the experimenter and the subjects — from a table of random numbers. These have no (designed) order; and the experimenter knows that there is no rational way in which he can detect it. This shows that randomization can also be used as an objective technique for stimulus sampling. More important in the present context, however, is the following problem, which presented itself in the analysis of the ESP data. It has again to do with objective data selection — although it might also be regarded as a question of coding (cp. 7; 1). Since the data from the children in one class, and, for the final processing, those in different classes, were put together, one item, i.e., one marked set of three, may be said to constitute an element in the sample (and in the universe) of responses. Some children, however, had not always followed instructions and sometimes checked two, or three, or none instead of one symbol. Is it a legitimate procedure simply to eliminate these elements from the sample and calculate the percentage of correct guesses from the remaining total ? In this form the question appears simple, and indeed is not difficult to answer: there are no objections. But, elimination of cases from a sample is at best a tricky business, since it is here that contamination will quite often intrude at the eleventh hour. In this highly controversial area in particular, where the effects, if they are genuine, are at best so slight that hundreds of trials are needed to produce significant scores, experiments call for special caution. Let us therefore take a closer look at the question. 'Three checks' or 'none' are easy enough to deal with. They are tantamount to no answer: the respondent provides no information about his choice. Elimination is therefore the only adequate solution. 'Two checks,' however, does provide a measure of information: the third figure is out. According to the null hypothesis — there is no ESP involved and correct guesses are purely accidental — there is a chance of two in three that this choice happened to be correct; so one could possibly, if the third one is actually false, score a half point. But elimination provides a simpler solution. One could reason that the respondent has not obeyed instructions for this item, and so has for all practical purposes supplied no answer. But — and this is the point — no specific argument is in fact needed: any elimination of any arbitrary number of elements from the sample is permissible — provided the elimination is in no way 194
6; 3; 4
6; 3
OBJECTIVE SELECTION
OF EXPERIMENTAL
MATERIALS
experimentally dependent on the series 'sent.' If one so wishes, one could eliminate every fifth answer of a respondent or all answers in which the third figure is checked, or two or three randomly chosen answers from each protocol. There would be little point in doing so, but what would be left could, under the null hypothesis, still be regarded as a random sample from the corresponding (infinite) universe. The essential condition is that the elimination must take place either according to an absolutely (machine-)objective principle (cp. 6; 2; 1) or, if no absolutely objective principle is available, be made by a person who, provided the null hypothesis is valid, can have no clue as to the series sent. 1 In fact, this is the same condition that applied earlier to the team member in charge of selecting patients for inclusion or rejection from the investigation (6; 3; 3). What this requirement amounts to in practice is that the basis for elimination must be decided in advance. In any case, it must not be made the responsibility of the person who performs the scoring; not even if the key supplied to him is absolutely objective — since he may commit errors. It should be obvious that in scoring and computing the number of correct guesses all possible measures to ensure objectivity must be taken — double, independent scoring, preferably by clerks ignorant of the purpose of the experiment, or better still by machines. A person who believes in ESP could easily miscount or miscalculate in favor of the number of correct guesses. We do not know whether these stringent conditions were always fully observed in Van Busschbach's experiments — all of which, incidentally, produced results of low positive values that were, however, highly significant for the large numbers of responses available (see VAN BUSSCHBACH 1952-1958). Unfortunately, not all fields allow such strict objectivity requirements to be set for experimentation and data processing. In general, decisions on case elimination must be worked out and set down in advance; ad hoc decisions must be avoided. If, for instance, one is planning a lecture room questionnaire to study certain aspects of new students' attitudes towards their university, it is essential to determine beforehand which of the respondents are to be eliminated as 'unrepresentative' cases. Elimination of, for instance, the over-thirties, whose motivation is likely to be entirely different, should not be a difficult decision to take — but 1 If he is 'telepathically sensitive' himself, he may of course be contaminated — as was mentioned in the previous footnote with respect to the experimenter.
6;3;4
195
6. O B J E C T I V I T Y : A. T H R O U G H T H E E M P I R I C A L
CYCLE
what about an eighteen-year-old, who is a freshman at this university, but went to a technological college the year before ? Or the twenty-year-old, who is indeed 'straight out of secondary school,' but who has worked in industry for one year? If such difficulties have been foreseen, the questionnaire will also contain items dealing with prior education, training, employment, with their dates, etc. On the basis of these factors, decisions can be made to eliminate cases according to objective criteria set down in advance. However, such decisions cannot always be programmed in advance. If they are not, the rule is that the investigator who does the eliminating — say, of all over-twenties as unrepresentative cases — has not looked through the answer sheets. If this rule is not strictly maintained, if, for instance, the investigator has seen some of the materials and has read those 'particularly good remarks' (bearing out the hypothesis-being-tested) in the responses of one of these doubtful cases, an objective decision is already virtually impossible. Unfortunately, laxity in this respect has in the past often been a loophole for intrusion of many conscious or unconscious contaminating factors. Sometimes advance specification of the elimination criteria is impossible, simply because a thorough study of the materials themselves is needed to find out which cases are 'unrepresentative.' This will occur particularly with complex, non-experimental materials, e.g., historical documents — where one must be an expert before one can make an intelligent decision about elimination. Frequently, investigations of this kind will include an entire universe rather than a sample. A fictitious example of this sort of situation has already been mentioned: how does an investigator eliminate the unrepresentative cases among Roman Emperors other than by a thorough historical study, which in itself makes it impossible for him not to be contaminated? Nevertheless, here too it is possible to suggest simple and reasonably objective methods. One may attempt to base the decision on objective data unrelated to the hypothesis or interpretation. The length of the reign may be one such criterion, the volume of available historical evidence another. If two or more criteria must be combined, an objective 'formula' may be used, etc. If even this is impossible, then an alternative procedure, unfortunately seldom used in these fields but not uncommon in clinical psychology (e.g., B E N D I E N 1959), is to consult an uncontaminated fellow expert. He must be capable of judging the materials, but ignorant of the special purpose of the investigation. Or to mention yet another possibility, the expert may 196
6; 3; 4
6; 3 O B J E C T I V E S E L E C T I O N
OF EXPERIMENTAL
MATERIALS
sometimes take the form of a textbook, or of an authoritative study of the field in question. Thus, the present author, engaged on research aimed at producing a psychoanalytic interpretation of medieval St. Nicholas legends (DEGROOT (1949) 1965), was able to decide which legends were to be considered 'characteristic of St. Nicholas' and 'important' in the western tradition, by referring to the opinion of a church historian (MEISEN 1931), whose views were certainly not influenced by any form of psychoanalytic thinking. It is especially these complex interpretative fields — cultural anthropology, psychoanalysis, clinical psychology — which present far more opportunities for applying simple objective methods and checks than is usually realized. We shall revert to this subject in 9 ; 2.
6; 3; 4
197
CHAPTER
7
OBJECTIVITY: B. D A T A C O L L E C T I O N A N D A N A L Y S I S
7; 1 O B J E C T I V E Q U E S T I O N S A N D
ANSWERS
7; 1; 1 The art of asking questions: precoding
We now turn to a domain of rather technical problems, again having to do with objectivity (and relevance): the art of collecting and processing data. Since in the context of a hypothesis testing investigation all data collection and processing serve to determine the value of a variable corresponding to a concept or construct (6; 1; 3), this art is part of that of objective instrumental realization (6;2). However, the topic is sufficiently important to deserve a separate chapter. In this chapter, the emphasis will not be on details or techniques, nor on their mathematical formalization, but rather on principles and problems underlying these techniques and inherent in their use. First, a terminological point: In what follows we shall prefer to speak of 'data collection,' and avoid the term 'observation.' The latter term is better reserved for direct behavioral and situational observation. In the present book it is used especially for the 'free,' creative and not necessarily objective forms of observation that are characteristic of hypothesis formative activities (cp. 1; 4; 2 and Ch. 2). Normative treatment of objective data collecting techniques under such a title as 'objective observation' might easily suggest1 that all observation must necessarily be objective in the technical sense of the word. This would amount to a recommendation to use, even in theory and hypothesis formation, only 'respectable' objective experimental outcomes and to close one's eyes to anything that 1 E.g., H E L E N P E A K 1953. What is more, the author, as appears from the contents of her article, understands by 'objective observation' also: processing, analysis, combination of items, variable construction, criteria for variables ('functional unity,' validity, reliability; see Ch. 8 of this book).
198
7; 1; 1
7; 1 O B J E C T I V E Q U E S T I O N S A N D
ANSWERS
observation in a broader sense may produce — a recommendation which, unfortunately, appears to be followed in some circles, but which is not in keeping with the tenor of this book. 1 Our first topic is the methodology of objective data collection or the art of asking objective and relevant questions. We shall be primarily concerned with questions asked of experimental subjects or respondents in scientific social research. In accordance with the usage prevailing for tests and questionnaires, we shall call such single questions items, irrespective of the variables involved. It may be that the answer to one item alone, possibly after coding, will produce the value of the variable to be determined; or it may be, as is generally the case with tests, that answers to various items must be combined to produce the variable. While in what follows the terminology and discussion will be chiefly attuned to (written) test or questionnaire variables, where items constitute the smallest elements, the reader should bear in mind that most of the following remarks will apply without modification to other types of questions and responses, for instance in an interview or an experimental situation. It will be obvious that for items the simplest, and in fact the only radical, solution to the objectivity problem — cp. the 'machine' definition in 6; 2; 1 — is to be found in the precoded or closed question. Here the respondent is given the choice between a number of pre-formulated alternatives. There are no marginal cases, each response falling into a preestablished category, and there are no loopholes allowing the respondent to get around the necessity of choosing. Frequently the alternatives themselves will, if necessary, include an escape clause, for instance the wellknown category of 'no opinion' in opinion surveys, or the residual category 'others' (in addition to stated possibilities) or 'none,' e.g., in questionnaires dealing with hobbies, study or reading habits, etc. The advantages of precoded questions from the viewpoint of objectivity, and ease of manipulation are evident. But, does not a system in which all questions are so rigidly formalized bring about a loss of qualitative information that might be important for many purposes? Cannot the format of the items cause too much loss of relevance ? For a long time many held that the precoded question in itself, com1 Generally, lack of attention for the more informal varieties of observation would appear to reflect a 'scientistically' overdrawn measurement ideology.
7; 1; 1
199
7.
OBJECTIVITY:
B. D A T A C O L L E C T I O N
AND
ANALYSIS
pared with the 'open' question (or situation), must of necessity be a kind of Procrustean bed. The risk that the guest's head would have to be chopped off to cut him down to size — the risk, that is, of a serious loss of relevance — undoubtedly exists, but it has proved less great and less insurmountable than many at first tended to believe. Thus for instance, the multiple choice question designed for use in achievement tests was found to have its uses not only in testing simple factual knowledge (Who was the founder of psychoanalysis? 1. Adler; 2. Jung; 3. Freud; 4. Lewin) but also for complex questions which demand insight into the subject matter and a thorough analysis of the problem. Take this section from an American chemistry test 1 : Ability to interpret cause-and-effect relationships Questions 30-33: Directions: Each question below consists of an assertion (statement) in the left-hand column and a reason in the right-hand column. Select A if both assertion and reason are true statements and the reason is a correct explanation of the assertion; B if both assertion and reason are true statements, but the reason is NOT a correct explanation of the assertion; C if the assertion is true, but the reason is a false statement; D if the assertion is false, but the reason is a true statement; E if both assertion and reason are false statements.
Directions Summarized A True
True
Reason is a correct explanation
B True
True
Reason is NOT a correct explanation
c True
False
D False
True
E False
False
1 The following sample question is reprinted with permission from the 1968 edition of A Description of the College Board Achievement Tests, published by the College Entrance Examination Board, New York. This booklet, which contains many illustrative examples of the different kinds of questions that are used in the Achievement Tests, is revised annually and is supplied without cost to high schools for distribution to students before they take the test. The booklet may also be obtained on request by writing to College Entrance Examination Board, Publications Order Office, Box 592, Princeton, New Jersey 08540.
200
7;l;l
7;1 OBJECTIVEE Q U E S T I O N S A N D A N S W E R S
Assertion 30. The electrolysis of a solution of sodium chloride produces chlorine 31. A molar solution of sodium chloride is a good conductor of electricity 32. In an equilibrium reaction, if the forward reaction is exothermic, increasing the temperature will result in an increase in quantity of the product 33. When ammonia gas dissolves in water, the water acts as a Bronsted base
Reason BECAUSE
sodium chloride is an unstable compound.
BECAUSE
such a solution contains a relatively high concentration of ions. when a stress is applied to a reaction at equilibrium, the position of the equilibrium is shifted in such a direction as to oppose the stress.
BECAUSE
BECAUSE
the resulting solution turns litmus blue.
In other areas besides achievement testing, skills in transforming (sets of) open, unstructured questions into (sets of) precoded ones without undue loss of relevance have also improved. The principle is always to transform the unknown number of potential responses to an open question into a choice from a restricted number of possibilities. The questionnaire designer's art consists, among other things, in covering categorically the range of cases that may occur — in accordance with the investigative goal. To guard against omission of relevant categories for each question and/or relevant items from a sequence of questions, a pretest is often run with open questions. Thus, for instance, a number of so-called 'free' interviews are conducted prior to the definitive construction of the precoded interview form; or a series of 'open' written curricula vitae is invited before biographical data are elicited in a precoded form. This is a good way of getting to know the principal response categories. Furthermore, by working with a relatively large number of items, it is possible, first, to achieve any desired degree of differentiation; second, to realize any form of composite response variables; third, to build all sorts of reliability and consistency controls into the instrument (cp. also 7; 3; 5). The variety of currently established item types for experiments, mental tests, oral or written surveys, attitude scales, biographical questionnaires, 7:1;1
201
7. O B J E C T I V I T Y : B. D A T A C O L L E C T I O N A N D
ANALYSIS
etc. is so great that it is impossible here to draw a comprehensive picture. One may, for instance, let an experimental subject or respondent choose: one out of n (frequently 4 or 5) given alternatives (multiple choice), or pick the two (three) most suitable responses from n possibilities (cp. COOMBS 1953). One may, in addition, ask him to order the selected alternatives, or all given alternatives (n). Or, if two series of say n and m elements are given (n < m), one may ask him to match each element of the n-series with the corresponding one from the m-series (cp. e.g. SPITZ 1953), e.g., to match a series of concepts or qualifications with a series of case descriptions. Or, the subject may be asked to judge whether a given statement is true or false; or whether he is willing to accept it as descriptive of himself or of his opinions, attitudes or personal habits, i.e., whether he agrees with it or not. Most of the objective types of tasks just mentioned can be turned into one or more multiple choice problems, without loss in the information obtained. Given the advantages of format uniformity in tests and other item lists, this is one of the reasons why the multiple choice question is particularly popular. Alternatives need not be totally different answers to the question posed; they may solely differ in the degree to which a statement is or is not endorsed, or one pole or extreme opinion is preferred over the opposite (e.g., OSGOOD et al. 1957). The subject's choice then is between a few positions on a pre-formed scale — of quantity, intensity, relative preference, attitude (positive/negative), or certainty. Furthermore, there are many ways in which these various types of objective questions may be combined; this more than anything else accounts for the wide variety of forms. Rather than give examples of all these possible forms, we may refer the interested reader to the extensive literature on the subject (e.g., L I N D Q U I S T 1959 and G U I L F O R D 1954 for tests; TORGERSON 1960 for psychophysical experiments and attitude scales; COOMBS 1953 and 1964 for a systematic treatment of question forms in a variety of areas). In summary, it is not too much to say that the choice principle — precoding of the question material — is nowadays generally preferred, certainly where hypothesis testing is concerned. Apart from fashion and practicality, the victory is based on the experience that objective questions, if handled with sufficient expertise, are much more adaptable and versatile in carrying relevance than had been expected in their early days. Of course, not all problems have been solved through this development. 202
7; 1; 1
7; 1 O B J E C T I V E Q U E S T I O N S A N D
ANSWERS
The technical problem of how to devise such questions, frame them, test their utility, and eventually combine them into an adequate instrument, need not occupy us here, since textbooks on the subject abound (see also Chapter 8). But a number of questions remain. What are the limitations of precoding? Can it entirely replace free response techniques? Can it be used for measuring productive or creative skill ? Or, does reduction to a set of choices necessarily destroy the creative moment? Can situational variables also be precoded, both objectively and relevantly? However, these and allied questions can be better dealt with in the context of the following discussion of other methods of data collection and processing. The very fact that there are other methods to some extent provides an answer; precoding, in a wide variety of forms, can often be used successfully, but certainly not for all types of (hypothesis-testing) problems. The art of getting answers: coding
Just as important as the art of asking objective questions is the art of eliciting objective answers; in particular answers from materials already in one's possession, or in other words: coding as a processing technique.1 This method is, for instance, indispensable whenever materials must be processed that were produced for purposes other than scientific research. These materials can then be made to answer precoded questions; that is, be treated according to a predesigned objective system in such a manner that for each specimen an objective score is obtained. This is what is meant by coding. Suppose — as a fictitious example — that someone is interested in testing the hypothesis that artists are more egocentric than scholars or scientists. He plans to use for this purpose the collections of letters from a number of deceased people available to him. Before seeing the letters, he decides that he will use as one operational definition of 'egocentricity' the relative frequency in these letters of the words 'I,' 'me,' 'my' and 'mine.' Whether such a project makes sense does not concern us here; it will be clear, however, that he has thus prepared an objective code. Each 1
The actual coding is done, not before the materials are obtained, but when they are already present, in uncoded form; this is the difference from precoding. It is assumed, however, that coding in our sense does take place according to a preestablished system of categorizations and evaluations: it is not evolved on the strength of the sample (cp. 7; 1; 3).
7;1;2
203
7. O B J E C T I V I T Y : B. D A T A C O L L E C T I O N A N D
ANALYSIS
word in the letters to be used to test his hypothesis provides an objective 'answer': it is either one of these four words (1) or it is not (0). Such objective coding methods, albeit usually of a more complex structure, are used, for instance, in the statistical branches of the study of language (see e.g., H E R D A N 1958) and, more often, in various fields in the behavioral sciences. Of course, there will be variations in the type of concepts that are instrumentally realized through such coding. The student of literature may be interested in an author's peculiarities of style or vocabulary through a comparative study of texts. The philologist may wish to test hypotheses concerning characteristics of different languages, or perhaps — in an applied area — in determining linguistic parameters needed for the construction of translating machines. In the social sciences, particularly in communication research, a whole system of coding techniques has been evolved, which are generally subsumed under the term content analysis (cp. C A R T W R I G H T 1953). So far research in this field has been concerned chiefly with verbal materials from the mass-media — newspapers, magazines, radio and television — in order to operationalize concepts like 'interest,' 'attitudes,' and 'prejudices' with regard to certain (e.g., political) issues. This restriction to certain topics and fields, however, is historically accidental rather than systematic. Wider applications of content analysis are possible in many fields and disciplines: history, sociology, the study of literature, and psychology. The use of coding methods is, of course, not restricted to non-experimental, existing materials (p. 203). Non-precoded interview, questionnaire, test materials, and other experimental protocols can be objectively analyzed for manifestations of attitudes, prejudices, or other value orientations; or pecularities of idiomatic usage which may be assumed to represent personal characteristics (cp. e.g., V A N L E N N E P and H O U W I N K 1955). To give another example, in machine-program-oriented studies of cognitive processes a recurrent problem is how to code 'thinking-aloud' protocols, if the search is for the subject's activities, methods, heuristics (see e.g., L A U G H E R Y and GREGG 1962, DE GROOT 1966, FRIJDA 1967). One of the oldest examples of partially objective coding of open materials is the systematization of the Rorschach technique ( R O R S C H A C H 1921; K L O P F E R and KELLY 1946). One of the reasons why this technique has been phenomenally popular for more than forty years and has sparked so much research was undoubtedly its combination of an experimental, but at the same time 'free,' 'open' and psychologically stimulating 204
7;1;2
7; I
OBJECTIVE QUESTIONS AND
ANSWERS
form, with systematic and partially objective response coding. Rorschach's example has been followed by many other designers and adaptors of 'open' tests. True, early expectations that this development might lead to the establishment of workable standardized personality variables have been relinquished. The indices derived from such free, open tests necessarily lack the degree of reliability and/or construct validity required for this purpose (see 8; 2 and 8; 3). This is not to say, however, that in certain cases a particular free test could not be the most suitable tool for constructing an index that adequately covers the construct. Its low reliability need not then be an insurmountable barrier to statistical hypothesis testing on fairly large materials; for instance, Atkinson's T.A.T.-studies ( A T K I N S O N 1958) and McClelland's research on the achievement motive in various cultures ( M C C L E L L A N D 1961). Coding problems of minor import may arise in practically every type of investigation. The inclusion in a questionnaire of one simple nonprecoded question, even if it calls for no more than a brief factual reply (e.g., What is your previous education ?), immediately presents problems of classification and coding. Aside from responses falling into the anticipated, obvious categories (for Dutch university students, e.g., HBS A or B, or Gymnasium A or B), one must always reckon with a certain number of anomalies or borderline cases; e.g., persons with more than one diploma, or with foreign academic credits or diplomas. Just as in the case of elimination of unrepresentative cases (6; 3; 4) — which might be a solution in the instance cited here — objective coding demands that the system of categories be fully prepared in advance so that each case can be categorized by mechanical-objective methods (but cp. 7; 1; 3). Although coding of 'free' materials does offer another important possibility for objective instrumental realization of constructs, and although technical skills in this area have improved greatly, it would be wrong to underrate the problems involved. As soon as we are dealing with more abstract, higher order, 'intrinsic' constructs, the polarity of objectivity and relevance again makes itself felt. Frequently it will be extremely difficult or time-consuming, sometimes it will be downright impossible, to design an objective coding (or scoring) system which still yields sufficiently relevant variables. Moreover, even when the effort has been rewarded by success in a particular investigation, the question will often arise whether this is a process ever to go through again. Quite frequently, the answer is (or 7; 1; 2
205
7.
O B J E C T I V I T Y : B. D A T A C O L L E C T I O N
AND
ANALYSIS
should be): No. Since the results are going to be expressed in objective variables, it is mostly preferable, the next time — if there is a 'next time' — to adapt the question form to this end: e.g., a multiple-choice Rorschach; a precoded questionnaire instead of a subsequently coded interview; a choice of four answers in a mathematics test instead of the requirement that the student provide the correct solution himself, etc. Given the great advantage the precoded form offers for most hypothesis testing as well as many applied purposes, one of the best uses of coding often is to prepare for precoding. But there are exceptions, cases where even a frequently used instrument had better remain open, with subsequent coding. Examples are found, as expected (p. 203), in the area of creativity. Whenever the construct-as-intended depends essentially on what, how much, or in what way a subject produces — whether a thinking-aloud protocol, an essay, a mathematical proof, a story, a drawing, or some technical product — coding may be the only way to realize the construct instrumentally in an objective and relevant manner. Incidentally, the motive for keeping open question forms is mostly that objective coding according to a pre-established system (p. 203ff.)— and thereby objective instrumental realization of the intended concept—is not considered possible. Actually, many Rorschach indices (such as Elizur's hostility index, cp. 5; 2; 3) and many content-analysis variables are not based on strictly objective coding instructions. There are certainly judgmental factors involved, even though these may be kept to a minimum. The problems posed by this 'semi-objective' type of instrumental realization will be discussed separately (7 ;3). 7; 1; 3 Ad hoc coding
So far we have assumed that the objective coding system has been determined independently of the findings in the sample. In the case of precoding, this condition is obviously fulfilled; in the case of coding, however, this is by no means self-evident. It is of importance to observe closely the distinction between the procedures discussed in 7; 1; 2 and the ad hoc coding that will be our concern now. In practice, the condition — independence from the sample findings — will often mean that the person who decides on the objective method of categorization, coding and/or scoring to be used must not have seen the sample materials. The individual we are concerned with is the person who formulates the coding system, not to be confused with the one who 206
7; 1; 3
7; 1 O B J E C T I V E
QUESTIONS AND
ANSWERS
carries it out. We still assume that no judgmental procedures are used (cp. 7; 3) and thus that the system itself is objective, that is, could be transformed into a machine program. In the example of the private letters used to test the egocentricity hypothesis (7; 1 ; 2), the idea to set up a code involving a count of the words 'I,' 'me,' 'my' and 'mine' must not have taken shape after perusal of the letters. If that was in fact the case, the hypothesis is no longer being tested against 'new materials' (cp. 1 ; 4; 5); positive outcomes do not provide new information, and the risk that accidental features of this particular collection (sample) were capitalized upon cannot be calculated nor excluded. In experimental hypothesis testing, it will in general be possible, and indeed necessary, to observe this independence condition with the utmost rigor. This implies that advance formulation in terms of precise, objective instructions must take place for: the sampling procedure (6; 3; 3); the criteria for possible elimination of certain cases (6; 3; 4); the manner in which for each case —i.e., for each protocol or unit of the materials — characteristics are to be determined and their subsequent classification effected (7; 1 ; 2); and, if there has been no precoding (7; 1 ; 1), for: the operational definitions of all relevant situational and subject variables (cp. also Ch. 8). But this ideal state of affairs cannot always be realized. Even in laboratory experiments, it may not be possible to set down the details of the instrumental realization of all the constructs in advance. In field research projects, in questionnaire surveys, and in the analysis of existing, nonexperimental materials in particular, there may be even greater difficulties. It may be that certain details could not be arranged, or at any rate, have not been arranged in advance, Most important, there may sometimes arise a necessity to depart from prior arrangements. In such cases one is simply obliged to resort to objective ad hoc coding. The simplest cases are those of reclassification of the sample materials for a characteristic that had already been 'measured' objectively under the original research plan (see 7; 2; 2). Coding for 'religion' may for instance have provided four categories: Roman Catholic, Reformed, Lutheran, or No religion — but in the course of processing it is now decided to distinguish only 'RC' and 'non-RC' because of the unforeseen small number within the sample of cases of any one of the last three 7;1;3
207
7.
O B J E C T I V I T Y : B. D A T A C O L L E C T I O N A N D
ANALYSIS
classes. Or, in a laboratory project, the duration of certain processes has, according to plan, been recorded in minutes; but it is decided, on the strength of the distribution found in the sample, to define and use only 'short' and 'long' periods — a change that may perhaps lead also to a different statistical test being used than the one originally envisaged in the design. These sorts of ad hoc decisions are very common. Simple and apparently trivial as such decision problems may seem, they nevertheless deserve special attention — as do decisions to eliminate certain cases from the sample (6; 3; 4), which might indeed be numbered among the coding decisions. Just as in the latter case, here too the change of classification introduces a forgotten or unforeseen 'factor,' a modification in an operational definition. And here, too, there is again the risk — even if the modification is unexceptionable from the viewpoint of 'mechanical objectivity' (6; 2; 1) — that in the new ad hoc version of the classification procedure an unidentifiable contamination has been built in. The emphasis is here on 'unidentifiable' (cp. 1; 3). Since distributions of raw data are rarely given in published reports, the influence of ad hoc modifications cannot be retraced if not expressly reported as such. The danger is, naturally, that such a decision will be taken not (only) on the strength of distribution characteristics of the one variable whose operational definition is being modified, but (also) because the researcher has a, perhaps vague, hunch that 'this better suits his book.' If, for instance, the researcher already has the impression that nothing in the data bears out a prediction, likewise contained in the hypothesis, of a difference between Protestants and the No-religion group, while the Catholic group clearly occupies a divergent position, then the combination of the two former groupings into 'non-RC' is suspect and misleading. The same applies to the example of the private letters. The given operational definition of egocentricity ('I,' 'me,' 'my,' 'mine') no longer represents an unbiased choice from among a number of possibilities, nor can it be changed without prejudice (e.g., by taking account only of the frequency of 'I'), once the letters have been read. It is these simple decisions of detail that are particularly vulnerable to contamination, even when the researcher is acting completely in good faith. The methodological recommendations that can be made for the prevention of this unfortunately too common error — to say nothing, again, of deliberate tampering with data — are basically the same as those applying to objective elimination (cp. 6; 3; 4): 208
7; 1; 3
7; 1 O B J E C T I V E
QUESTIONS AND
ANSWERS
(1) Ad hoc coding should be avoided whenever possible, that is, there should be a deliberate effort to make all decisions in advance. (2) If this is impossible, the decision can a) be delegated to a non-contaminated assistant or colleague, or b) be based on logical, objective grounds such that contamination can be ruled out — in this order of preference. (3) If this is not possible, one should at least openly report the ad hoc coding that has taken place and, in addition, process the testing materials once more according to the original coding method, reporting the outcomes. These recommendations, with the exception of the second half of (3), apply even more strongly in those cases where no coding method has been formulated in advance. Particularly the first recommendation is important: Try to avoid it. The best procedure with existing materials is often to set a sample of them apart for use in developing a coding method in a preliminary investigation (cp. 5; 1; 4). If this step is omitted, all sorts of difficulties, apart from contamination problems, are to be expected. One of the most common mistakes made in unquestionably well-intentioned research projects — and this goes also for exploratory investigations (cp. 2; 2; 3 and 2; 2; 4) — is that the researcher decides to collect an interesting, 'rich' set of materials and to leave the 'details' of the decisions on how they are to be processed to draw conclusions (test hypotheses or, possibly, evolve hypotheses or structure the materials) until after they have been obtained. The often undeniable 'richness' of the materials — introspection protocols, responses to open opinion survey questions, real-life case histories and the like — will all too frequently contrast rather shrilly with the scantiness of the conclusions that can be drawn from them. This is not to say that such procedures are impossible or of necessity useless. In investigations of a descriptive and/or exploratory nature, they are often necessary, albeit extremely difficult to handle properly (cp. 9; 1; 4 and 9; 1; 5). But in the context of hypothesis testing investigations, ad hoc coding, in any shape or form, should be avoided as much as possible, if not entirely.
7; 1 ; 3
209
7.
OBJECTIVITY:
B. D A T A C O L L E C T I O N
AND
ANALYSIS
7;2 Q U E S T I O N F O R M A N D P R O C E S S I N G TECHNIQUES 7; 2; 1 Relationships between collection and processing
The separately coded response to a 'question,' asked either of a person or a case in the materials, is the basic element of many instruments. This is the form in which the response becomes available for further analysis. But usually this will not as yet yield the value of the variable sought. The raw, separate responses must undergo further processing — they must often be combined — and this, too, must be done objectively. In many cases this further processing is quite complex, so that there is a large 'distance' between its final product, the value of some variable, and the raw observational data. In a survey, for instance, individual respondents will be asked to state their opinions on various issues, but the researcher may be interested in establishing certain underlying group characteristics which are not immediately evident from the distribution of the responses (e.g., the 'latent structure' of the set of respondents, l a z a r s f e l d 1954). Or, experimental subjects are given a set of tests, but the actual objective is to do a factor analysis to determine factor loadings and factor scores, or to measure indirectly certain personality dimensions (e.g.,eysenck 1952b). Or again, the researcher's real objective in collecting observables in an experiment is to calculate an intervening variable that is a complex function of the primary data; etc. Technically, the requirement of objectivity does not in itself pose any really new problems here: most of these processing techniques are in the nature of classifying, counting, computing, or some more complex mathematical operations, which are either objective by nature or capable of being objectified without too much difficulty. The 'art of processing,' however, again consists in doing this in such a manner that the outcome — the value of the variable — not only is obtained objectively but has relevance as well. The problem is to establish an objective processing technique that both suits the primary data and the instrumental realization purpose. But we must widen the scope of the problem a little further still. Objective processing presupposes a 'goal' and an 'object' (cp. 6; 1; 1). The goal is: instrumental realization of a construct-as-intended; the object 210
7;2;1
7; 2 Q U E S T I O N F O R M A N D P R O C E S S I N G
TECHNIQUES
(that which is to be processed) consists of: observational data to be collected on some real world system, or: the responses to a set of judiciously asked pertinent questions. Now the nature of the goal and the character of the object will together determine, first, what (objective) data collecting technique, what method of asking questions, is the most adequate. Secondly, the instrumental goal, again, together with the object system — now in the form of data as generated by the technique chosen — will determine what processing method is the most adequate. Conversely, however, the adequacy of the data collecting technique will depend also on what processing methods are envisaged. In other words: the problem of adequate data analysis cannot be divorced from the problem of adequate data collection. In the instrumental realization of a construct, therefore, they must be solved together, with due allowance for their interaction. This can be taken care of in a variety of ways. Different instrumentalstatistical technologies have been developed for different problem areas — and by different investigators. Some of the observation and/or processing techniques are of a fairly general nature, for instance, the construction of Guttman scales ( G U T T M A N 1950), or, factor analysis (see e.g., THURSTONE 1947); others were designed for more restricted purposes, for instance the forced choice technique in personnel screening (SISSON 1948). The variety is virtually unlimited. All the greater, therefore, is the merit of the American investigator Clyde H. Coombs, who arranged all these different models of measurement, all these methods of data collection and analysis, according to their basic logical structures and unified them under a general comprehensive system (COOMBS 1953 and 1964). Such a system allows a general survey to be made of existing techniques (cp. also TORGERSON 1960). Through the stress it lays on the basic logical structures and the assumptions implied in each collecting and processing method, it leads to the detection of frequently surprising cross-ties between techniques that had previously evolved independently of each other. In addition, the logical elaboration of the system has led to the development of new question forms and methods of analysis. 7; 2; 2 Measurement and measurement scales
7; 2; 2
Here we shall briefly discuss only one aspect of the problem, namely the extension and differentiation of the concept of 'measurement,'' 211
7. O B J E C T I V I T Y : B. D A T A C O L L E C T I O N A N D ANALYSIS
resulting from the development exemplified by the work of Coombs (cp. also e.g., STEVENS 1946 and 1951; TORGERSON 1960, and many others). Previously, measurement was generally conceived as a process in which the magnitude of something was empirically determined as precisely as possible and expressed in a numerical figure subject to the common operations of arithmetic. In the behavioral sciences, too, the efforts of those intent on using exact procedures were directed mainly toward evolving measurement techniques and variables of this kind, that is, variables for which the values are magnitudes, quantities, numbers of units, extensive measures, metric distances, rates, and the like. With the growing need for objectification and the growing arsenal of techniques of objective instrumental realization, however, the notion of measurement has been broadened. 'Measuring' has become equivalent to mapping into an objective scale, but not necessarily a metric 'scale.' 1 Stated otherwise, 'measuring' is assigning numbers to objects on the strength of certain objective empirical operations. What operations of arithmetic then apply to these numbers depends on the scale in which they are to be read. Four main types of measurement scales may be distinguished (STEVENS 1946, COOMBS 1953): 1. The nominal scale. Insofar as numbers are used — which is not necessary here but often convenient — each object to be measured is assigned a number, which, however, is no more than a code number. Nothing essential will change — i.e., no information will be lost nor spuriously generated — if the numbers used are, one by one, replaced by a set of other arbitrarily chosen code numbers (one-to-one transformation). Different objects may have the same code number, which indicates that they fall within the same category. 'Measuring' in the nominal scale is objective classification into qualitatively different categories. In other words, identifying plants or classifying occupations is 'measuring,' if it is done with complete objectivity; dividing experimental subjects into men and women (and assigning them the respective numbers 0 and 1, or 1 and 0) is likewise measuring. 1
Torgerson does not go quite so far. He reserves 'measuring' for those cases where a 'scale' can be said to be involved which runs from 'low' to 'high'. To him, the nominal 'scale' is not a measuring scale; 'measuring' starts at the ordinal scale, with or without a fixed numerical 'zero' value (TORGERSON 1960, Ch. 2;3).
212
7; 2; 2
7; 2 Q U E S T I O N F O R M A N D P R O C E S S I N G
TECHNIQUES
2. The ordinal scale. Here the numbers indicate rank order. If the set of numbers used is replaced by any set of others in such a way that the sequence remains the same (monotonic transformation), nothing essential changes. One may choose whether or not to permit different objects to be given the same rank order ('ties'). If one permits this, special rules must be adopted. Rank ordering objects according to the magnitude of an observed attribute and assigning ascending or descending but otherwise arbitrary numbers to them is 'measuring' in the ordinal scale. This is done, indirectly, if only the rank order of metric measurement outcomes — the body heights of a regiment of recruits, the lengths of time needed by each of ten experimental subjects to perform an action — is taken into consideration. On the other hand, experimental subjects and judges called in for research purposes (7;3), or practical purposes, are often asked to give their ratings directly on ordinal scales. Thus, students in a class can be ranked on their achievement; industrial jobs on the level of their overall requirements; beauty queens on their looks; various types of diplomas on the difficulty of obtaining them; etc. 3. The interval scale. This is the first 'metric' scale. Here the numbers are metric to the extent that intervals between measuring points can be compared, and therefore 'measurement' in a stricter sense can take place. Means of scale values now make sense: the mean of 4 and 8 is 6 only, it should be noted, if it can be assumed that 8 — 6 = 6 — 4, that is, that equal scale differences represent equal intervals of attribute magnitude. But that is not yet to say that '8' represents twice as large an attribute magnitude as '4.' Continental school grades are a case in point. Traditionally, they can be and often are averaged, but there is no fixed 'zero' point; in particular, zero cannot be construed to mean no achievement at all. Nothing essential changes, therefore, if a constant number is added to or subtracted from all the figures used and/or if all figures are multiplied by some fixed number (linear transformation). Intelligence quotients, for instance, are indeed averaged, but IQ 140 does not mean 'twice as intelligent' as IQ 70. If one should so wish, one could divide them all by 100 and/or take say 0 as the average instead of 100. 4. The ratio scale. Here the numbers are full-fledged metric measures with the aid of which the magnitudes of the attributes of the objects of measurement can be compared directly. This obtains with most physical measures: length, volume, duration, speed, pressure, energy, etc. A ratio scale is an interval scale with a fixed zero value. The only transformation 7; 2; 2
213
7. O B J E C T I V I T Y : B. D A T A C O L L E C T I O N A N D
ANALYSIS
allowed without changing anything essential is to multiply all scale values by the same number; nothing is changed in that case except the unit of measurement (scalar transformation). In the behavioral sciences, typical categories are, for example, amount of time needed for a task or process, and amount of output (measured by identical units). The order of the scales presented here was from 'weak' to 'strong.' From the viewpoint of measurement precision and mathematical data processing, less can be done with weak scales than with strong ones. On the other hand, when strong scales are used, more assumptions are imposed on the observational data, and the crucial question is whether this is justified (see 7; 2; 4). In any event, the existence and deliberate use of weaker scales in the social and behavioral sciences has already proved extremely useful, especially since in the last two or three decades a large number of new statistical tools have been developed, which allow weaker data to be treated with exactness. In contrast to the older parametric techniques of hypothesis testing, which assumed not only an interval scale but for the most part also a normal distribution of the variable within the population, the nonparametric methods of hypothesis testing (SIEGEL 1956) are not based on such assumptions. With these methods, objective and adequate statistical treatment is possible for qualitative (systematically categorized) or rank ordered data. As a result, the relevance of a variable as the operational representative of a construct is no longer, as often used to be the case, necessarily questionable because of an undue quantity of gratuitous assumptions implied in its operational definition. The number of possibilities for adequate instrumental realization and processing has again been enlarged. 7; 2; 3 Scale construction and measurement as analogue representation
Measurement in this extended sense is clearly a very basic activity in all scientific enterprises, as well as in everyday life. It is a device, or rather the pre-eminent device, to bring the relationships, situations, and processes in the phenomenal world within our grasp, and, notably in applied areas, to control the given natural and cultural phenomena. It seems useful, therefore, to consider what precisely it is we do when we construct a scale for a particular purpose and make measurements with it. As usual, the corre214
7; 2; 3
7; 2
QUESTION
FORM
A N D
PROCESSING
TECHNIQUES
sponding problems will be discussed in a general manner without attempts at mathematical formalization. Whenever, for the measurement of certain phenomena — in the form of a variable — a scale is adopted, this implies the choice of a particular mathematical model, with corresponding axioms and rules of arithmetic. For a first approximation, the implications of choosing one of the various models can be elucidated in terms of a spatial interpretation of the measurement scales. The phenomena of the outside world, the real object systems which are observed and recorded, are represented in analogue in the model of the scale chosen. The scale, with the corresponding results of measurements, may be regarded as a 'map' of the real object system; hence the term 'mapping' is sometimes used for constructing or choosing a scale and implementing it with data. This map must correspond as closely as possible to the real object system; hence the method of mapping must be judiciously selected. From this viewpoint we shall take another look at the four scales. The nominal scale corresponds mathematically to a variable defined by the partitioning of either a finite or an infinite set; spatially to the representation of an open or closed space divided into regions. The implications and possible complications of the model are described mathematically in set theory and can be spatially represented in Venn diagrams (for an introduction to the subject, see e.g., K E M E N Y , S N E L L and T H O M P S O N , 1957, Chapters 2 and 3). Phenomenally, this mode of representation, with its appropriate statistical techniques of analysis, is adequate in cases where objective categorization and counting according to qualitative characteristics is warranted and possible. Such cases are extremely common; our earlier examples (sex, species, occupation) can be readily supplemented with a host of others: nationality, religion, political party, school subjects, 'types' in a typology, etc. The nominal scale model does not put any restraint upon the number of classes nor on the number of elements within one class. For instance, classification according to name or telephone number — where each class comprises but a single element — also constitutes nominal categorization. The ordinal scale corresponds mathematically to a variable defined by a series of ascending or descending numbers to which any monotonic transformation can be applied, including the one to the series of natural numbers: 1, 2, 3, etc. Spatially, the simplest analogue representation is that of discrete points on a straight line, which can be individually 7; 2; 3
215
7. O B J E C T I V I T Y : B. D A T A C O L L E C T I O N A N D
ANALYSIS
shifted along at will but which must remain discrete and are not allowed to pass each other (like beads scattered on a long string). In most cases the model implies that new points may be inserted between any two given points, but this is not necessarily so. In a system of military ranks or (Hindu) castes it is expressly understood that there are no intermediate points; a system of social classes, however (e.g., the usual American range from upper upper to lower lower), is usually conceived as a continuous dimension. Phenomenally, the ordinal analogue is applicable whenever any two elements, ai and aj, of a set of objects, A, can be appropriately compared on an attribute, X, in such a way that the outcome is either: ai is more X than aj, or: ai is less X than aj. If in the ordinal model 'ties' are allowed, there is room for a third possibility, namely that ai and aj are indistinguishable as regards their degree of X. In that case, they are assigned the same numerical value. Omitting X, the relationship can be symbolized as follows: ai > at (ai is more X than aj); ai < aj (ai is less X than aj); ai = aj (ai and aj are indistinguishable on X). The first two are asymmetric, the last is symmetric, i.e., ai = aj is equivalent to aj = ai. For both types of relationships transitivity must be assumed. The transitivity condition states that if ai > aj, and aj > at, then necessarily ai > at, and respectively that if ai = aj and aj = ak, then ai = at. The situations where these conditions obtain are also of frequent occurrence. Whenever phenomena exhibiting an increase of a particular quality or property ('Steigerungsphanomene,' SELZ 1941) can be observed or abstracted, such as intensity, momentum, magnitude, size, intricacy, masculinity, spatial position (e.g., from left to right — provided the arrangement is not circular), beauty, radicalism, tone pitch, etc., there is a continuum on which we can in principle build an ordinal scale. Some of these phenomena can also be treated metrically, but some cannot, e.g., in physics the continuum of solids (rocks, crystals) according to hardness. A standard problem in the social sciences is whether for a given continuum an ordinal scale can be transformed into an interval scale, and an interval scale into a ratio scale (cp.e.g., STEVENS 1951; TORGERSON 1960). Discrete rank order systems occur mainly where humans constructed them; for instance, in organizational structures and hierarchies. In the study of such systems, the same problems of transforming the given ordinal scale into an interval scale are often of importance since the 216
7; 2; 3
7; 2
QUESTION
FORM
AND
PROCESSING
TECHNIQUES
question is apt to arise as to how the distances between the various levels compare. The interval scale corresponds mathematically to a variable (or set of numbers), of which only those properties are considered that are invariant for linear transformation: x' = ax + b (with a ^ 0); spatially to points on a line, the total pattern of which can be shifted, or proportionally stretched or contracted, at will. In other words, there is no fixed 'zero' point on the line, and the unit of measurement is arbitrary. Here, an underlying continuous dimension is assumed, since operations like 'taking the mean'—which is permissible here (7; 2; 2) — make it possible in principle to arrive at any intermediate (rational) point on the line. Phenomenally, the most important condition for applicability of this analogue is that it must be possible to speak meaningfully of 'equal intervals' between any two pairs of elements or observation points. In the temperature scale, for instance, the difference between 40° and 30° centrigrade is, by definition and measurement devices, guaranteed to be as large as that between 30° and 20°; but a temperature of 40° is not 'twice as high' as 20°. Apart from this, other examples of obviously justified interval scales are not easy to find in the phenomenal world. If they are not ratio scales in disguise, they are often constructed or postulated for continua which primarily lend themselves only to ordinal rating. 1 A professor marking test papers on some numerical scale may make a deliberate effort to maintain equal differences in quality between papers marked by consecutive scale numbers. This is in fact his only justification if he is in the — wide-spread — habit of calculating averages. The scale he uses is then assumed to be an interval scale. But, the materials themselves by no means make it clear that the judgmental intervals introduced between consecutive scale points are in fact equal. They may of course be based on some systematic calculating procedure, for instance, '.5 off for each 1 Of course, the time scale we use, and also our spatial scales, are, in the abstract, interval scales — without a fixed 'zero' value. If we consider, however, that it is generally a 'property' of 'a system' we want to measure — e.g., the duration of a process, the position of an object, the distance from 'zero' — then we can indeed maintain that interval scales are rare, at least in laws and hypotheses about nature and culture. Incidentally, the temperature scale is not such a simple and convincing example either, among other things because of the existence of an absolute zero temperature. But this is not an aspect we can go into here (cp. CRONBACH and MEEHL 1955).
7; 2; 3
217
7.
O B J E C T I V I T Y : B. D A T A C O L L E C T I O N A N D
ANALYSIS
mistake' — likewise in test psychology: 'one point for each item.' But such a system in turn assumes a, basically arbitrary, equivalence of all the mistakes or all the items — or in psychophysics of all the 'smallest perceptible differences' (FECHNER 1860; THURSTONE 1927). While there are numerous experimental methods of data collection and techniques of data analysis which can do much to diminish the arbitrary nature of such assumptions of equivalence (cp. e.g., TORGERSON 1960), the fact remains that these techniques in turn introduce fresh assumptions. The model of the interval scale (distances on a line with no fixed 'zero' value) is seldom applicable to the phenomenal world without rather strong further assumptions. 1 The ratio scale, finally, corresponds mathematically to a variable (or set of numbers) for which only those properties are considered that are invariant for scalar transformation: x' = ax (with a > 0); spatially to points on a line with a fixed zero value, where only the unit of measurement is arbitrary. 2 Phenomenally, this analogue corresponds to all cases where the question 'how much?' or 'how many?' can be perfectly answered. This is the case with all so-called 'extensive properties' (cp. COHEN and NAGEL 1934, Ch. 15), where one can count from 0 or measure a quantity (metrically) — encompassing all variables that can be expressed in numbers of elements, size, magnitude, or duration of time. In the physical sciences almost every observable is measured in this fundamental metric scale. In the social and behavioral sciences, too, it is easy to find obvious examples: the frequency of a phenomenon, the time taken to accomplish a task, reaction speed, the number of units produced per time unit, the quantity of saliva secreted by a dog in a Pavlov-type experiment, etc. 7; 2; 4 Problems of isomorphism
By no means all phenomena of the real world lend themselves to suitable analogue representation by the
1
In many practical applications the real crux of the matter is neither in the nature of the data nor in the sophistication of the psychometric techniques of analysis and scale construction; the justification for the equivalence of qualitatively different errors or items springs rather from a silent social contract between tester and testee: such scoring is 'equitable' (see DE GROOT 1969). 2 Alternatively, one could keep the unit constant and proportionally stretch or contract the pattern of the points, according to the formulation chosen for the interval scale, p. 217). In the same way, the operation of shifting the zero point along the line — forbidden in the case of the ratio scale, but allowed in an interval scale — is of course equivalent to shifting the total pattern of points, in the reverse direction. 218
7; 2; 4
7; 2
QUESTION
FORM AND PROCESSING
TECHNIQUES
mere application of one of these four scales. The question whether an analogue is appropriate is usually called the problem of isomorphism. All measurement is based on the assumption that the phenomena of reality and the model or scale in which they are measured are isomorphous. Is this assumption justified? With the nominal scale, there may be borderline cases which escape categorization. This difficulty may occur in many fields, for instance, in typologies based on discrete patterns or 'types.' It cannot always be solved simply, by the elimination of cases (from the universe or the sample) or by the creation of a new category for 'other cases.' With the ordinal scale, it may happen that the relative positions can be determined for some pairs of data but not for others. In rank ordering school diplomas according to their level of difficulty, for instance, it may be easy to decide on pairs involving only more and less advanced programs in the same fields, but impossible to reach intersubjective agreement (see 7; 3) where different fields are concerned (e.g., Latin versus mathematics). A case in point is, in the Dutch system, the question of the relative position on a difficulty scale of the Gymnasium A and the HBS-B diplomas (mainly humanities and mainly sciences, respectively); according to the prevalent opinion they are 'incomparable.' From the viewpoint of scale construction, a solution is provided by introducing a 'partially ordered' scale as an intermediate form between nominal and ordinal ( C O O M B S 1953). But if one is interested in further analyzing such data, one usually adopts a different procedure: a number of assumptions are introduced and the scale is transformed into an ordinal one, for instance by regarding cases of uncertain relative positions as equivalent. Another intermediate form, this time between ordinal and interval scale, may be found in the 'ordered metric' scale (op. cit.), which may serve as an analogue representation whenever the 'intervals' can be determined for some pairs of elements but not for others. But, again, in practice this scale is not easy to manipulate. A solution is often found by introducing certain assumptions, and thus transforming it into an interval scale. Likewise, interval scales are sometimes transformed into ratio scales by introducing a fixed zero value — through a procedure based on fresh assumptions. Some problems of analogue representation can be solved only by distinguishing more than one dimension. Whenever description and discrimination of 'structures' or 'types' is involved, none of the major unidimensional scales is adequate—with the possible exception of a Dominal 7; 2; 4
219
7. O B J E C T I V I T Y : B. D A T A C O L L E C T I O N A N D
ANALYSIS
typology (see above). On the other hand, there is a possibility of working with vectors (e.g., multi-dimensional temperamental and/or physical types: HEYMANS 1932, KRETSCHMER 1921, S H E L D O N and STEVENS 1942). To gain the maximum benefit from the spatial analogue and the corresponding mathematics, it will often be good policy not to regard each of the distinguishable variables as a separate dimension, but to take account of the empirical relations among the variables. This is often done by spatially representing non-correlated variables (r = 0) as rightangled vectors — they have no 'causes,' i.e., spatially, no components, in common — and correlated ones (I r I > 0) as vectors with a bigger or smaller relative angle as r is smaller or bigger. In factor analysis ( F R U C H T E R 1954; T H U R S T O N E 1947) and other techniques of (metric) multivariate analysis, as well as in the unfolding techniques and related methods for weaker scales developed by Coombs and his associates (COOMBS and KAO 1954), multidimensional models of various kinds are used for mapping real object systems. Each of these models is based on a set of assumptions, and these are apt to grow in number as the analogue representations become more complex and the scales stronger. As a result, in using complex, multidimensional models the crucial issue of isomorphism often becomes difficult to handle properly; that is, it may become increasingly difficult to keep track of what one is 'really' doing to his data, and, correspondingly, to interpret what the ensuing outcomes mean. While we shall not discuss details of these elaborate procedures, we want to point out expressly a new form of 'dilemma of the social scientist' (COOMBS 1953, p. 485), which has been brought out repeatedly in the foregoing discussion. The issue here is not objectivity versus relevance, but a closely related dilemma: the choice between true correspondence of the selected abstract system to the phenomena (or fidelity to the real object of study) and manipulability of the data in their mapped form. Admittedly, it will hardly ever be possible to collect (or process) data and map them into a suitable abstract system without making some prior assumptions. But the number of assumptions one makes can be greater or smaller; and one can strive, first of all, to hold down the assumptions to a minimum, and secondly, to be clearly aware, in introducing them, where, when, and why they are made. In Coombs' terminology: in selecting methods of data collection and analysis — always in the context of the determination of a variable (the instrumental realization of a 220
7; 2; 4
7; 3
JUDGMENTAL
PROCEDURES:
INTERSUBJECTIVITY
construct) — it is essential to distinguish sharply between the information contained in the data and the (pseudo-)information imposed on them by the measurement system. What is to be avoided specifically is that 'fidelity' be unduly sacrificed to 'manipulability.'
7; 3 J U D G M E N T A L P R O C E D U R E S : INTERSUBJECTIVITY
7; 3; 1 Judges as measuring instruments
Some qualitative data, and some real-life situations, are so complex that it does not appear possible to find an objective measure for the construct-to-be-defined or the factor under investigation that is sufficiently relevant. In situations of this kind it is fairly common, even in rigorously organized hypothesis testing investigations, to resort to the use of a judge as a measuring instrument. In other words, the wine of objectivity is watered down: the judge performs a task which could not, or could only with great difficulty, be taken over by a machine. Usually, this method is resorted to because no better solution is available and/or because in the field in question the prevailing and socially accepted practice is to be guided by the judgment of experts (e.g., the medical specialist who diagnoses 'asthma' or 'ulcus,' cp. 5; 3; 2). The essential requirement is, of course, that there is sufficient confidence in the 'degree of objectivity' with which the judge operates. This implies, in connection with our machine definition of objectivity (6; 2; 1), that a systematic analysis of his method of judging, if actually carried out, would go a long way toward construction of a satisfactory formula (machine program) which could replace the judge. What is required, therefore, is a satisfactory degree of objective specifiability. This can be held to exist if there are sufficiently valid grounds to assume that the judge is guided, whether consciously or intuitively, by a system of reasonably constant, if undefined, criteria. Empirically, this may be ascertained from the consistency of his ratings — e.g., transitivity (see 7; 2; 3) within a set of independent comparison judgments — and especially from the reliability of his judgments in independent replications of the procedure. As long, however, as his system remains unspecified — and explicit 7; 3; 1
221
7.
OBJECTIVITY:
B. D A T A C O L L E C T I O N A N D
ANALYSIS
(machine) objectivity is therefore unattainable — there must be guarantees that the system itself is not unduly subjective. The major empirical criterion that can be applied here, and hence also to what is generally understood by the 'degree of objectivity' of a judge, is the extent to which the ratings of one judge (expert) correspond with those of other judges (experts). In research settings where a number of judges are employed, this degree of intersubjective agreement or inter-judge reliability can be empirically assessed and can serve as an objectivity control — provided rigorous safeguards are set up against mutual contamination of the judges. In judgmental procedures, it is mostly this criterion of intersubjectivity which takes the place of the objectivity requirement. This is not to say that the two notions are strictly equivalent in meaning: complete intersubjectivity between judges is still not synonymous with objectivity, for the system is (as yet) unspecified. But in their general purport the terms are closely related. The social significance of the objectivity requirement lies mainly in the circumstance that, whenever there is objectivity, complete intersubjectivity is guaranteed; that is, there is complete agreement on scale values and thus on the practical (operational) meaning of the variable in question. Hence it is frequently sufficient to require no more than the agreement itself, or even only 'a reasonable degree of intersubjective agreement' among individuals considered capable of expert judgment. This, again, widens the range of methods by which we can get hold of relevant factors, though admittedly with some, it is hoped not undue, loss of objectivity. Judgmental procedures are in particular fairly frequently used in constructing so-called criterion variables. 'Criteria' or 'criterion' variables are typically used in two kinds of research projects. One type is evaluation research (6; 2; 2): the effect measure is often called the criterion. Examples have already been given in the foregoing, such as the 'insight gained' (as measured by an achievement test) to determine the effectiveness of a particular method of geometry instruction (6; 2; 3), or the 'lessening of inner discord' (as measured by the Q-sort technique) to assess the effects of therapeutic treatment (6; 2; 4). In validation research, criteria are the variables against which the validity of a particular method of prediction is measured. Examples are: operationally defined measures for 'scholastic achievement' or 'job proficiency' as used to validate 222
7;3;1
7; 3
JUDGMENTAL PROCEDURES:
INTERSUBJECTIVITY
(assess the validity of), for instance, a given aptitude test. They are needed to answer the question: To what extent are the test predictions fulfilled? The criterion is, then, the outcome variable with which the predictor variable is correlated in order to determine its (predictive) validity1 (see further 8; 2). The criterion here represents operationally the prediction goal, that which is to be predicted (for an individual or case), just as in evaluation research it represents the effect goal. In applied areas, a common characteristic of the two types of research is that the goal is, to a considerable degree, determined by the requirements of social conditions, so that the criteria must either be directly derived from these conditions or at least be constructed in close rapport with them. As a result, it is often very difficult for the effect goal or the prediction goal (success, aptitude, adjustment, health, possibly even 'happiness') to be given instrumental realization which is both objective and relevant. Hence, the frequent practice in applied areas of using incompletely objective criteria based on judgments. Examples are: school grades (given by teachers) as measures of acquired knowledge or course achievement; personnel ratings by supervisors as measures of proven job proficiency; assessment by a clinical psychologist of 'improved adjustment' as a result of therapy; etc. For some criterion constructs, or for some aspects of them, there is no other possibility than to use judges. This is clearly always the case whenever the research object is to determine, not what a person or thing is like, but how he (it) is appreciated or judged. The degree to which an individual is 'socially adjusted' in his day-to-day life, for instance, must be considered by definition to depend in part on the extent to which he finds acceptance and appreciation with individuals in his environment. In addition, then, to criteria derived from his own attitude towards life (self report criteria, e.g., 'inner discord,' see above)and objective behavior criteria (e.g. quality of work, absenteeism, frequency of medical treat1 The terminology of (predictive) 'validity' and 'criteria' has its origins in, and is still in the main confined to, test psychology. As we shall see in 8; 2, however, the underlying principles are applicable to many other fields. 'Diagnostic-predictive procedures,' for instance, are found in numerous applied areas; whenever study of an individual, a group or a situation, by a given method, leads to a statement of a predictive nature, it is reasonable to inquire into the validity of the method. The validity issue, like the evaluation problem, is of fundamental methodogical importance, since prediction, together with ability to control or influence phenomena, is among the prime objectives of (applied) science (cp. 1; 3; 1).
7;3;1
223
7. O B J E C T I V I T Y : B. DATA C O L L E C T I O N A N D
ANALYSIS
ment, objectively observable symptoms), judgments by others are needed as criteria (cp. e.g., FIEDLER, DODGE, JONES and H U T C H I N S 1958; FIEDLER, HUTCHINS, DODGE 1959). Evidently, then, judgmental procedures are indispensable. But, given the risks involved (6; 1; 2), what can one do to control for reliability and intersubjectivity ? 7; 3; 2 Specific problems in judging
Admittedly, after our initial disavowal of the judge (6; 1; 2), we have now reinstated him, if only for certain problems of instrumental realization that defy a strictly objective solution. Consequently, a discussion is in order of the precautions and controls (7; 3; 3) that can aid in keeping the disturbing influences of such necessary subjectivity within reasonable bounds. Before we can do this, we must first form an idea of the specific objectivity problems that may occur when judgmental procedures are adopted. These problems may be of many kinds. They can best be illustrated by a practical example, such as the grading of answers to a given question in a history examination. We shall assume that a 'reasoned answer' in the form of an essay is required; furthermore, that there are N candidates, numbered 1, 2 ... i, j ... N, and that the judgment must be expressed as a numerical grade, on the descending scale 10 to 1, for the 'insight shown' into the subject matter. We assume there are two judges: the history instructor (H), who has taught the candidates, and the state supervisor (S), who knows neither the instructor nor the candidates. H first judges the work and marks his grades on the essay papers. These are then passed on to S, who in turn expresses his judgment by the grade he writes on each paper. Both will make a sincere effort, we assume, to be as objective as possible in judging each paper entirely on its own merits. What are the weaknesses inherent in this common procedure? What are its sources of error? From our viewpoint, the problem is one of instrumental realization, in this case of the concept 'insight shown' into history, specifically into the historical developments of the specified period. The judges H and S must isolate this aspect (a) from the essays and judge it independently of all other aspects (b, c ... etc.). The first thing that strikes us is that there are so many other aspects and facets. Each essay bears a name; it goes with an individual (known 224
7; 3; 2
7; 3 J U D G M E N T A L
PROCEDURES:
1NTERSUBJECTIVITY
to the instructor, H); its legibility may be good or poor (handwriting has a 'character' of its own); the spelling and grammar may be faulty; there may be a greater or smaller number of digressions, which in turn may or may not be pleasant to read. Some essays are well written, perhaps even stylish or witty; others may be boring or clumsily phrased. Some are long, others short; some circumstantial, others concise; etc. All these aspects have little or nothing to do with the question whether the essay shows the required historical 'insight' — and the crucial problem therefore is whether the judges will be able to suppress the disturbing effects resulting from them. As for the instructor H, it will be clear that — if one goes by the standards applicable to hypothesis testing and, strictly speaking, also to examinations — he knows too much to be able to form an objective judgment. He knows, for instance, that candidate 3, while not a 'brain,' is a very pleasant, outgoing personality, whose frequent contributions to class discussion were remarkable for sound common sense rather than historical expertise: 'this candidate will no doubt make his way.' On the other hand, he knows (or thinks he knows) that candidate 7 has not applied himself and is moreover 'a bit sneaky' in class. It will be difficult for H to free his mind from these personality characteristics; no matter how hard he tries, he cannot help reading the essays in a special way, associating them with the mental images he has formed of his students. But S, too, 'knows too much': he sees the handwriting — that of number 3 is perhaps 'mature,' 'controlled,' and highly legible, whereas that of number 7 is 'cramped' and hard to decipher. He sees the mistakes, reads the digressions, the peculiarities of style, etc. His judgment, too, will inevitably be influenced by these aspects (b, c ... etc.), which are irrelevant to a proper, objective, appraisal of aspect a. He, too, is subject to the so-called halo effect, the disturbing influence of obtrusive characteristics other than the a-variable1 to be judged. Besides, it is impossible to keep his judgment independent of that of H — the first thing he sees being the grade assigned by H. 1
The term 'halo effect* is mostly used in the context of personality ratings, e.g., when because of outstanding — positive or negative — social qualities, an employee is also rated too high or too low on other factors, such as initiative shown in the work situation or the quality of his work. In the eyes of his supervisor he can either 'do nothing wrong' or else 'do nothing right.' The same 'halo' or 'blinding' effect may occur also in rating non-human 'multi-dimensional' objects. Methodologically,
7; 3; 2
225
7.
OBJECTIVITY:
B. D A T A C O L L E C T I O N
AND
ANALYSIS
Moreover, H is an interested party — but so is S. H is anxious to see his students successful, to 'get good examination results,' because he regards this as a criterion for the quality of his teaching. S is a less interested party, but nevertheless he will prefer to 'steer clear of conflicts,' which would be certain to arise, for instance, if he downgraded all papers in comparison to H. In any case, it is almost inevitable that he will, to some extent, adjust his judgment to the average achievement level of this class. That will also be the case with H, or rather he will long since have adjusted his teaching methods and grading practices to the average class level. Further, the grading practices of both H and S will be subject to the effects of their respective personal equations — a term coined originally for the individual differences found to occur in astronomical observations at the Greenwich Observatory in 1796, and later adopted in a more general sense for all types of individual variations in judgments. For instance, the central tendency and dispersion in H's grading practice may be such that he will, as a rule, give no more than 5 % failing grades, very rarely a high grade (9), and never the top grade (10). S, on the other hand, may not be averse to 'getting out on a limb' by giving extremely high, or extremely low, grades when he personally feels such grades are deserved. The results in his case will be, say, 20 percent failing grades, including 3's and 4's, and an average of 10 percent grades over 8. Anyone familiar with the daily practices in education will know that such differences are not uncommon. Another highly important obstacle to intersubjective agreement is that H and S may hold different views about their task in judging aspect a. What really is 'insight shown' into history and how is it to be evidenced? H may set much store by an intelligent reproduction of the ideas he himself has put forward in his teaching. S, one the other hand, has in all likelihood not been informed about H's way of teaching. He may value other aspects; the yardstick he applies may be, first and foremost, whether or not the candidates have produced 'nonsense,' for example. He reads and judges what is down in black and white and may be less
the problem in demonstrating the existence of a halo effect is invariably that it must be properly distinguished from a genuine concurrence of, positive or negative, qualities or properties (cp. e.g., T H O R N D I K E 1920; TIFFIN and MCCORMICK 1958, Ch. 8, p. 222-228; BARENDREGT 1961, Ch. 4).
226
7; 3; 2
7; 3
JUDGMENTAL
PROCEDURES:
INTERSUBJECTIVITY
inclined to give credit for 'good intentions'; also, of course, he has less data at his disposal to interpret such 'good intentions'. The vagueness of the given task — to judge the 'historical insight shown* — is likely to affect not only the intersubjective agreement between H and S but also the judgmental reliability of each of them. The notion of what constitutes 'insight shown' and how this is to be evidenced may well shift in the course of the judging process. For one thing, this may occur as an influence of sequence effects: later ratings will not be independent of the foregoing ones. After a run of, for instance, three particularly poor efforts both H and S will tend to breathe a sigh of relief when the next one is fairly acceptable and accordingly be inclined to mark it 8 rather than 7 or 6. To summarize, apart from the supposed objective quality of aj, the rating (grading) of aj is affected by: 1) the conception each judge has of his task (semantic effect)-, 2) the 'blinding' effect of other aspects (b, c ... etc.) that interfere with the grading of aj (halo effect); 3) the persistent influence of preceding essays (at and further back) on the aj grade (sequence effect); 4) the universal and personal factors which the latitude in the use of the grading scale allows (distribution effects, such as adjustment to the average level of the group: norm shifts, and personal equations)', 5) factors of personal interest — of motives other than openminded, unbiased judgment — which the latitude of the entire judgmental procedure allows to interfere, whether consciously or unconsciously (contamination effects in a stricter sense). While each of these five categories of rating errors may be reflected in reductions of reliability and intersubjective agreement, this is not necessarily so. For instance, the extra datum available to S — H's grade — is a variable whose undesirable, contaminating influence is likely to cause an increase in agreement between the grades given by H and S. The same thing applies to other, less evident contaminations. Only insofar as the disturbing factors lead to fluctuations in one judge's ratings over time or to deviations between judges, will their effect be reflected in a decrease of reliability or intersubjective agreement. Relatively constant and/or common peculiarities, prejudices, or interests of different judges cannot be detected, even less eliminated, by controls operated afterwards. 7; 3; 2
227
7.
OBJECTIVITY:
B. D A T A C O L L E C T I O N A N D
ANALYSIS
So it is not enough to build empirical controls into the judging procedures. In addition, we must search for precautions to prevent contaminations of all kinds. 7; 3; 3 Controls and precautions
From the viewpoint of objectivity, the instrumental realization of quality a, 'insight shown,' clearly leaves much to be desired. There is a wide variety of possible disturbing subjective factors and it is quite clear that they may have a strong and confusing effect. What remedies can be proposed against this multitude of evils, supposing that such a quality, based on ratings by judges, were to be used in a hypothesis testing investigation? The five points (sources of error) mentioned in 7; 3; 2 correspond roughly with the following remedies: 1) reduction, simplification or more explicit description and definition of the judge's task (task a); 2a) maximum elimination of other irrelevant aspects (b, c ... etc.): what the judge need not know for an objective judgment of a, he must not know; 2b) to the extent that such elimination is impossible, for instance when individuals, test responses, essays, newspaper articles, works of art, or other complex entities are to be judged for one abstracted aspect, concentration on this aspect in such a manner that the procedure promotes abstraction from other, irrelevant, aspects (b, c ... etc.); 3) variation of sequence in the presentation of ai in the judging procedure, with built-in replications (thereby allowing consistency and reliability controls); 4) restriction of freedom in the distribution of gradings over the scale; 5a) the use of judges whose sole interest is to produce serious, expert, objective judgments; 5b) the use of a number of judges, working in complete independence, whose judgments can be compared and combined (thus allowing intersubjectivity controls). Remedies 1 through 4 relate to the design of the judging or rating procedure. When applied to our illustration, for instance, they can be detailed as follows.
Sub 1) Reduction: This amounts to sharper delineation of the aspectto-be-judged (a: 'insight shown') by more explicit specification of the judges' instructions. By definition, the instructions for a judgmental 228
7; 3; 3
7; 3 J U D G M E N T A L P R O C E D U R E S :
INTERSUBJECTIVITY
procedure cannot be completely objectified, but they can be developed in the direction of an operational definition through some form of coding. To this end, empirical materials will be needed to develop and try out the method — with controls for practicability, reliability, and intersubjectivity. This means that pilot investigations will have to be carried out (cp. 5; 1; 4), a step in the design ofjudgmental procedures the importance of which can hardly be overemphasized. From these pilot studies, there will result instructions specifying what the judge must be on the lookout for, and how he is to evaluate and weigh each factor and aspect of a ('insight shown'). Sometimes a set of standard examples will be evolved for the judge's guidance. Simple illustrations of such 'semiobjective' coding methods may be found in the literature on mental testing, for instance in W E C H S L E R 1958, for the appraisal of responses to some subtests of the Wechsler Adult Intelligence Scale. Where a complex aspect like 'insight shown' is involved, the instructions for the rating procedure are likely to require a listing of those specific aspects which the judge must include in his appraisal. For instance, have the essential facts (specified as, say, fi, h , ••• fs) been dealt with? Have the two major connections (ci and C2) among related facts been clearly stated? Is the argument as a whole logically coherent? Or, are there nonsequiturs and other errors of logic ('nonsense,' cp. p. 226)? That is to say, the aspect a will be first subdivided into ia (facts), 2a (connections), 3a (logic) ... etc., each of which will be described and clarified as precisely and concretely as possible. Then a standard method of weighting and combining the ratings for ia, 2a, 3a, etc. will be established to produce the final rating — either with or without any liberty for the judge to depart from the formula outcome on the strength of unspecified characteristics. In this manner the judge's appraisal is, at least in part, made subject to objective instructions. What remains 'free' has become simpler, more specified, and more clearly outlined. Sub 2) Elimination and concentration: One effect of the steps described under 1) is undoubtedly that aspect a, having been specified, is now more clearly distinguished from b, c, ... etc. On the other hand, aspect a itself was subdivided into ia, 2a, 3a ... etc.; and each of these subaspects has in turn become vulnerable to halo effects. Elimination of b, c ... etc., therefore, no matter how successfully carried out, cannot 7; 3; 3
229
7.
O B J E C T I V I T Y : B. D A T A C O L L E C T I O N A N D
ANALYSIS
provide more than a partial solution. Even so, the judge's task can be considerably refined by eliminating irrelevant data: the names may be left out;'the answers may be re-typed in a uniform manner; and, possibly, spelling mistakes might be corrected, before the judge receives the materials. But one may not wish to go much further: faulty sentence structures, digressions, style, and length can be corrected, but these aspects may be inextricably interwoven with what is relevant to the rating of, for instance, 3a (logic). Judge H, in any case, cannot help recognizing at least some individuals from these peculiarities (cp. sub 5). A simple method of concentration is provided if the judge rates or compares all the essays for one aspect at a time (see 7; 3; 5). If he first reads and judges all of them for aspect ia, and subsequently for 2a etc., this will at least diminish the risk of mutual interference (halo effect). Sub 3) Variation of sequence, in replications: This is a comparatively simple matter, so long as independent replication itself presents no difficulties. The chief problem is that the judge has a memory; consequently, the second time around he may still know what he did the first time and simply act 'consistently.' The replication, then, provides no new information; the first rating (and the first sequence) has already been decisive: reliability controls are of no use. Safeguards against this — again, none of them perfect — may be found: (1) by allowing a certain amount of time to elapse between the two series; or (2) by having the judge do so many ratings that he may be presumed to have forgotten what he did last time; or (3) by applying not only repetitions but built-in indirect controls as well, for instance, by checking for intransitivities. These various methods will, at least, make it less easy for the judge to be deliberately consistent rather than to make a fresh, unbiased appraisal of each new case (cp. 7; 3; 5). Sub 4) Restriction of freedom in the distribution of ratings: This is an obvious device to guard against the effects of all those personal idiosyncrasies in judging which make for interjudge differences in the distribution of gradings. It provides a considerably simpler solution than the ideal: to determine by a genuinely empirical method the 'personal equation' for each judge, and correct accordingly. Such more general phenomena as the 'error of central tendency'—i.e., individual differences in the tendency to keep down the spread in the ratings given, cp. P A T E R 230
7; 3; 3
7; 3 J U D G M E N T A L P R O C E D U R E S :
INTERSUBJECTIVITY
SON 1950, p. 153 — can likewise be counteracted by prescribing the distribution of ratings in the scale. The simplest and most common method is to specify a 'forced distribution,' to which each judge has to conform in his own sample; he is to give, for instance, a rating 8 or higher to 10% of the objects or subjects, the rating 7, to 20 %, etc. The prescribed total distribution is often designed so as to roughly approximate the normal distribution; e.g., in the case of five classes: 10%, 20%, 40%, 20%, 10% respectively ( c p . e . g . , BELLOWS 1 9 5 6 , p . 3 7 9 ) .
However, such methods have their disadvantages. Both the average (or median) and the spread become fixed — with the result that the level of, and degree of variation within, one sample cannot be compared with those of another sample. Further, the judge is forced to introduce distinctions and dividing lines between subgroups in places where he may not want them, and conversely, to ignore differences that he might particularly have wanted to bring out. The dilemma is patent: if the judge is put under too much restraint, there will be a loss of information in some places, or unreliable pseudo-information will be introduced in others (cp. 7; 2; 4); but if he is given a completely free hand, his judgment will be more likely to reflect irrelevant idiosyncrasies. According to the nature of the problem, either the one or the other disadvantage may appear to be the least evil. Intelligent compromises between a forced and a free distribution are also possible, for instance, by requiring that each of the two extreme classes of the scale must be used at least once. Alternatively, it is possible to require strict adherence to a particular form of forced distribution, but to allow the judge to express, preferably in a coded form, the degiee of confidence or certainty with which he makes each judgment. Particularly in the case of comparison judgments—e.g., paired comparison (7; 3; 5) — this method will often provide an acceptable solution. The judge is then forced to state which of two objects he feels to be 'more X' (cp. p. 216), but also to qualify his judgment by adding his degree of certainty. The judge may thus find his task more acceptable, while the researcher will get more information, which he might use to refine his rating scale. Finally, the loss of information as to the level of the entire sample as a result of the forced distribution method can be compensated for by asking for a separate rating of the group's over-all level. The effect is that 'relative' judgment and 'absolute' judgment are presented and solved 7; 3; 3
231
7. O B J E C T I V I T Y : B. D A T A C O L L E C T I O N A N D
ANALYSIS
as two separate problems — another decomposition which is generally justified and realistic.1 7; 3; 4 'Disinterested' judges
The remedies against the contamination effect mentioned in 7; 3; 2 sub 5 are to be found in the choice of judges. This means, in essence, that the judges should have no extraneous interests (5a) and that several, independent judges should be employed (5b, see 7;3;3, p. 228). These last two conditions are in fact the crucial ones. All precautions and controls can be rendered practically worthless if the judge is an interested party and if there is any possibility for him to use, consciously or unconsciously, his freedom to further his interests; or alternatively, to resist and possibly overcompensate for that 'temptation,' whether consciously or unconsciously. The possibility of misusing judgmental freedom almost invariably exists. The only really effective safeguard is, therefore, to employ disinterested judges, as many as possible, so as to counter any still remaining subjectivities of whatever nature and origin. It hardly needs saying that a judge like H, who has so many personal interests at stake and who, moreover, so clearly 'knows too much,' should be eliminated in a hypothesis testing investigation. Further, the different judges must naturally work independently. That means, not only that H's grades must not be marked on the essays passed on to S, but also that there must not have been any form of contact or consultation, either direct or through a third party. In our illustration, the grading should be done by experts (historians, sufficiently familiar with the subject matter taught and provided with a complete set of instructions), each working entirely on his own, without any intermediate contacts. For each of the first four points enumerated in 7; 3; 3 ideal conditions for judging are difficult to realize. This applies even more strongly to the fifth point. Judges are people, and people inevitably have some sort of interest at stake in nearly everything they do — over and above their desire to perform the judging assignment to the best of their ability. 1 In the case of our illustration, the major question, for 'absolute* judgment, concerns the dividing line between a passing grade and a failing one. This can be regarded as a separate problem, to be solved by providing the judge with separate pertinent instructions. These will contain a detailed explanation and specification — again, in the direction of operationalization — of where exactly the dividing line is to be drawn for "insight shown.'
232
7; 3; 4
7; 3 J U D G M E N T A L P R O C E D U R E S :
INTERSUBJECTIVITY
Differences of viewpoint, prejudices, private theories one would like to see confirmed, a disinclination to show one's 'true colors,' or a desire to prove some point: these are universal human propensities which are liable to interfere with the seemingly most neutral judging assignment. Nevertheless, where the judging concerns products of behavior in one form or another — an observer's protocols, recorded test responses, tape-recorded conversations, cinematographic or musical compositions, newspaper articles or reports — it will frequently be possible to work out valid and reasonably objective rating-variables.1 In that case, the judgmental procedure can be repeated ad libitum, and there is the guarantee that at least the materials being judged, the concrete, factual basis, will remain constant. These conditions are not fulfilled in the judging of individuals or of directly observed situations and events that cannot be repeated. Here, there is no alternative except to rely on the judgment of those who have direct knowledge of the individuals or first-hand experience of the situations and events. But these persons will nearly always be interested parties, while, moreover, their 'factual materials' — their experience of the individuals, situations, or events — are never identical. This is the reason why, as one example, personnel ratings, and as another, the testimony of eye witnesses, whether used in court or as the historian's materials, are so difficult to handle as variables (for the latter aspect cp. e.g., GOMP E R Z 1939, 14, on 'Authorities'). The best solution is to have a number of independent ratings performed by different people with different interests: in personnel ratings, for instance, by the direct supervisor, the personnel officer, and preferably also by colleagues ('peer ratings,' cp. e.g., T U P E S 1957, quoted in C R O N B A C H 1960, p. 523) — if not by subordinates. However, the possibility of conscious or subconscious 'con1 In principle, the criteria for judgmental variables do not differ from those for objective instruments and variables, to be discussed in Chapter 8. The reliability criterion (8;3), however, is here of a dual nature, requiring intra-judge reliability as well as intersubjective agreement among judges. The consistency requirement is entirely analogous (cp. 8; 4). The (construct) validity problem (cp. 8; 2; 3) amounts to the question: To what extent does the judge judge what he is supposed to judge? An important sub-problem, then, is the question as to whether his ratings are contaminatedby other factors that are irrelevant to the intended judgment. The problem of contamination requires separate consideration in judging procedures, since it can be solved only through advance design measures: there is no possibility, as there is with objective variables, of later checks, empirical controls, or a detailed analysis of the manner in which the variable was arrived at.
7; 3; 4
233
7. O B J E C T I V I T Y : B. D A T A C O L L E C T I O N A N D
ANALYSIS
spiracies,' in the sense of mutual contaminations, can never be ruled out altogether; the same applies even to the, usually reassuring, occasions when defense and prosecution witnesses agree in their testimony. Judgmental problems will always remain precarious, however skillfully they are tackled; judgmental instruments can at best produce dubious variables. But they cannot be dispensed with, if the scope of research in the behavioral sciences is not to be needlessly limited. If they are used, however, the judging procedure is at least as important as any other aspect of the research design (cp. 5;1), and therefore should be paid scrupulous attention. 7; 3; 5 The judge a subject-, paired comparisons
Judges and judgmental procedures have been introduced in this chapter as a substitute for objective methods of measuring variables. In our discussion of the criterion variable (7; 3; 1), however, we already remarked that the researcher may be interested in genuinely subjective variables; that is, not in 'what a thing is like,' but in 'how it is experienced by subjects.' The researcher's concern, then, is not how best to approximate some supposed objective attribute or quality of the object judged, but how best to establish objectively the subjects' judgments themselves (their opinions, perceptions, feelings or preferences). Intersubjective agreement is, then, no longer a requirement: while judges should agree, subjects may of course differ. Intersubjectivity can now be treated as a problem in its own right, which has no bearing on the adequacy of the instrumental realization of the construct. In other words, the problems discussed in 7; 3; 4 (7; 3; 2 sub 5) become no less important, but the corresponding requirements and methods of instrument construction are not applicable as such. The other requirements (7; 3; 3), however, are equally essential for the proper and objective determination of subjective variables. This means that much of what has been said about judgmental (rating) procedures applies in precisely the same way to the determination of subjective variables of perception, judgment, feeling, attitude, opinion, and preference. Since such variables are intensively studied and frequently used in many branches of the behavioral sciences, the controls and precautions discussed in 7; 3; 3 are of even greater importance than originally stated. It may be useful, therefore, to discuss somewhat further the technical side of the problem, by describing briefly one important method by way of illustration. 234
7; 3; 5
7; 3 J U D G M E N T A L P R O C E D U R E S :
INTERSUBJECTIVITY
The method chosen here is that of paired comparisons. Actually, it is only one out of a large number of techniques by which data on judging behavior can be collected. But, it is certainly an important method, which, moreover, can conveniently be discussed with the help of our example: how to judge 'insight shown' from the essay-type answers to an examination question. For problems of comparative judgment — perception, preferences, etc. — the method of paired comparisons frequently provides a satisfactory solution. It breaks down the judge's task into simple units (see 7; 3; 3 sub 1); it can be combined with methods of elimination and concentration such as rating per-aspect (e.g., ia, 2a, 3a, etc.; see 7; 3; 3 sub 2), as well as with variations of sequence and/or consistency controls of similar purport (see sub 3). Moreover, it embodies a frequently acceptable compromise between free and forced distribution of ratings (see sub 4). Finally, the constancy (stability) of the ratings per judge and the intersubjective agreement among raters (inter-judge reliability) can be studied very satisfactorily by a number of empirical methods. In the per-aspect form, the smallest unit of the judge's task is the assignment to state which of two subjects or objects (in our example: essays) he considers 'more X' than the other (cp. 7; 2; 3). The symbol X represents the factor or aspect to be judged; in our example, 'more X ' might be: better in representing the essential facts, ia. Mostly, the instructions will require the subject to make a definite choice for each pair. Without this constraint — which can also be viewed as a mild restriction of freedom in the distribution (7; 3; 3, sub 4) — individual differences might again intrude, now through the relative frequencies of the rubrics 'no opinion' or 'no difference' (ties). But this is, in fact, the only compulsory element in an otherwise natural, psychologically sound, and simple procedure, which yields for each pair i, j information of the type jai > iaj — or, omitting the subdivision of attribute a (7 ;3 ;3, sub 1): ai > aj. It may be supplemented by a statement of the degree of confidence with which the judge has made his choice. In this way a great deal of data can be obtained. In fact, it is often the sheer size of such a program that will present an obstacle. If each judge has to rate all the pairs, there are no fewer than 1/2 N(N-l) ratings to be made; for n judges this number becomes 1/2 nN(N-l); for f factors: 1/2 fnN(N-l); for h replications: 1/2 hfnN(N-l) — a number which may easily exceed the limits of practical feasibility. However, it will often be 7; 3; 5
235
7. O B J E C T I V I T Y : B. D A T A C O L L E C T I O N A N D
ANALYSIS
possible to take some shortcuts, particularly in the number of replications (h = 1 will frequently suffice), as well as the number of pairs presented for rating (cp. e.g., TORGERSON 1960, Ch. 9,7; G U L L I K S E N 1956; G U L LIKSEN and T U C K E R 1961). The problem of sequence effects (see in 7; 3; 3 sub 3) can usually be solved fairly adequately, even within one series, by distributing the presentations of every element ai optimally over the series — either by a systematic or a randomizing procedure. A major advantage of paired comparisons and similar procedures is that consistency, stability, and intersubjectivity can be empirically tested and controlled. Inconsistencies will become manifest whenever a judge is intransitive: a.i > aj; aj > a^; a^ > ai. Shortcomings of stability and of intersubjectivity will show up in detail by inversions: for instance, first time: ai > aj: second time: ai < aj — in replications with the same judge or in judgments by two or more judges, respectively. The statistical and inferential analysis of such data can be pursued in various directions (cp. e.g., COOMBS 1964). In the case of our example, the set of the subjects' per-aspect and per-pair responses does not automatically produce the ultimate scores. The separate data must be combined in some way to produce a final score for the variable ('insight shown'). What kind of combination formula must be used? If it is linear, what weight must be given to each response, each judge, each sub-attribute or factor (ia, 2a, 3a, etc.) ? How are intrajudge inconsistencies and inter-judge differences to be allowed for? If the judges have expressed their degree of confidence, how can these additional data be taken into account? Technically, these may not be particularly difficult problems. It will be clear, however, that they can be solved only by making a number of, let us hope, judicious but still somewhat arbitrary assumptions (7;2;4), about allowance for error, compensation rules, combination formulas, weights to be assigned, and the like. One important advantage remains, however, namely that these assumptions can and must be made explicit. Assumptions must inevitably be made if overall scores are to be produced; in a differentiated design like that of paired comparisons they must show up and can be considered one by one. It is thus possible to distinguish the information contained in the data from that 'imposed' on them. 1 1
The combination problem here touched upon will occupy us again, e.g., in 8; 4 and 9;3.
236
7; 3; 5
7; 3 J U D G M E N T A L P R O C E D U R E S : I N T E R S U B J E C T I V I T Y
7; 3; 6 From expert to formula
Let us return now to the judge as a (substitute) instrument — that is, expressly without generalization to subjective variables (7; 3; 5). The foregoing will have made it clear that efforts to minimize subjectivity in judging procedures often go in the direction of coding and processing by formula; that is, in the direction of an approximation to the 'machine ideal' of objectivity (6; 2; 1). To many, however, far from being an ideal, this calls up bleak visions of a dreaded future. For one thing, it is the technical-mathematical, 'dehumanized' character of this objective approach to science which fills them with chill foreboding. For another, it is the apparent implication that the expert, the man of learning, with his insight and wise judgment in his particular field, is on the way out. To these opponents, the formula is in the nature of a threat. This appears quite clearly, for instance, in the emotional discussions about 'clinical versus statistical prediction' in psychology (SARBIN 1944; MEEHL 1954; HOLT 1958; DE GROOT 1961); to say nothing of Sorokin's abovementioned affect-laden diatribe against anything that smacks of 'social physics' (SOROKIN 1956).1 From this viewpoint, the substitution of a 'formula*— i.e., an objective procedure for the collection and processing of data — for the expert's judgment is unwelcome. Such opposition is often supported by, and rationalized through, the conviction that no 'dead' mechanical formula will ever succeed in replacing the 'living' expert judgment, born as it is from understanding (Verstehen). But this reasoning stems from a misunderstanding. When judgmental procedures are objectified, the expert's creativity is not suppressed; it is only used in a different way. The expert's non-explicit weighting of aspects and factors, his interpretations, the intuitive hypotheses that are implied in his mode of judging are now transformed, at the earliest opportunity and in an approximative, frequently more or less ad hoc fashion, into a 'machine program,' a formula. This formula will no doubt lack many of the nicer discriminations and subtleties of the expert judgment —and to that extent complete substitution is in fact impossible — but this is offset by very real advantages. The formula is indeed 'dead,' but because it is, it is also reliable and stable, not subject to fluctuations, sequence effects, 1
Continental instances of opposition to objectivity, and even more to a 'machine ideal,' are numerous and widely scattered through philosophical, psychological, and educational writings. A single example, of the phenomenological-existentialist variety, may suffice: SARTRE 1939, p. 3 if.
7; 3; 6
237
7. O B J E C T I V I T Y : B. D A T A C O L L E C T I O N A N D
ANALYSIS
involuntarily shifting weightings, contaminations of various kinds, which continually threaten the scientific as well as the practical utility of a 'living' judgmental variable. In many areas, most dramatically perhaps in that of prediction in psychology (cp. in addition to the literature cited: WILLEMS 1959; V A N DER GIESSEN 1957; DE GROOT I960; B A R E N D R E G T 1961), it has already been found that these advantages of the formula, for purposes of direct use, are very considerable, both in hypothesis testing and in applied areas. The expert has already been found to be replaceable, in a good many fields. But, that releases him and his valuable time for other purposes, particularly for what S A R B I N (op. cit. 1944) defined as his real task, hypothesis formation (see also DE GROOT 1961). Constructing and refining formulas to replace judgmental variables — whether predictors, criteria, or situational variables — must be regarded as important hypothesis formative activities. They require not only the outcomes of completed research projects and statistical analysis of the judging-variables involved, but also the expert himself and his ideas. Through a systematic, possibly introspective analysis of his judging process, combined with an empirical-statistical analysis of its results — cooperation of expert (e.g. clinician) and statistician — an attempt can be made to translate the expert's standards and distinctions into a formula. In a number of fields it has been found that this can be done successfully, for example, with the rating processes of a committee of experts (c.o.p. 1959, Ch. 3). Formulas evolved in this manner may bear the marks of their tentative origins, and they may look 'awkward' from a theoretical viewpoint. But as far as instrumental realization of a construct is concerned, they have, in addition to the advantages of objectivity and greater reliability, the important quality that they are transparent: it is possible to establish and analyse precisely what is done with the primary data. If these formulas are theoretically awkward, they can be publicly criticized, and on the basis of further information improvements can be suggested — which is impossible in the case of the much less transparent judgment of experts. Developments in many fields of social and behavioral science are moving, and should be moving, from expert to formula. The expert is not 'dethroned' as a result of this evolution. On the contrary, when his judgmental and interpretative processes are objectified, his contribution — ideas, primarily — is utilized to the best advantage.
238
7; 3; 6
CHAPTER
8
CRITERIA FOR E M P I R I C A L V A R I A B L E S AND
INSTRUMENTS
8; 1 I N S T R U M E N T A L U T I L I T Y O F A V A R I A B L E
8; 1; 1 Relations among basic concepts: a recapitulation
Before we engage in a discussion of the subject proper of this chapter, we shall briefly recapitulate some of our earlier definitions and interpretations in order to cut down terminological uncertainties. We have seen that a construct (or an attribute, concept, or factor) can be represented by a variable; and that any variable can be regarded as representing a construct. In the social and behavioral sciences the relation between construct and variable is, in most cases, not one of complete coverage; often some surplus meaning in the construct can and must be distinguished, i.e., a non-covered area, which may be smaller or greater, and more or less vaguely delimited. If there is any surplus meaning, then the variable embodies 'an' operational definition of the construct — in addition to which there may be other operational definitions. The reverse is also true: an operationally defined variable can, in different contexts, represent different, if naturally related, constructs. In many statements on scientific research, the term 'construct' ('concept') or 'factor' on the one hand, and 'variable' on the other, are interchangeable. For instance, in the same sentence it is often possible to speak at will of the concept of age, the age factor, or the age variable. However, in the present book the terminological convention has been adopted of referring to a 'variable' only if there is agreement on the type of operational definition to be employed. The manner in which, in the empirical manipulation of the construct, the distinction will be 8; l; l
239
8.
CRITERIA
FOR EMPIRICAL
VARIABLES AND
INSTRUMENTS
made between those cases to which it is or is not, or to which it is less or more, applicable must in principle have been settled. Or to put it another way: we speak of a 'variable' only if agreement has been reached about the types of empirical operations by means of which we are going to discriminate between cases in which the variable will assume different specific 'values.' In .this connection, 'value' may also be defined as the inclusion in a qualitatively distinguished category or class in a nominal scale. If, for instance, with regard to the religion factor in an investigation, we know in principle how the value (say, RC, Protestant, or No religion) is to be determined case by case, then we can refer to the religion variable. Or to give another example: if, with regard to the construct 'leadership climate,' we know in principle in what way this is to be employed as a varying experimental condition in a series of experiments — if we know, for instance, how 'authoritarian' or 'democratic' leadership will be operationally defined — then we can refer to the 'variable: leadership climate.' 1 Or, finally, when we have settled that intelligence and hostility are respectively to be measured by a (particular) test and a (particular) Rorschach index, then we can refer to 'the variables intelligence and hostility.' However, the operational definition of a construct is not complete until there are exact and complete specifications of the set of instruments to be used, as well as instructions for the operations to be carried out to determine the value which the variable will assume in each concrete case. These must include instructions for the manner (the scale) in which the outcome is to be read (cp. 6; 2; 3). We have termed this complete set of instructions and devices the instrument in the broader sense. In its narrow sense, the term 'instrument' has roughly the usual meaning of a concrete measuring device, a test form, a questionnaire, a set of criteria, and possibly a judge (7; 3; 1). In its broader sense, an instrument always defines one variable2 and hence also one construct. So it may be said that 1
Naturally, it is here assumed that the construct will be used in the sense of a variable — that is of a varying entity. If a psychologist wants to run experiments with rats to test a hypothesis on the behavior of rats, then it is indeed of some importance to distinguish 'rat' from 'non-rat,' but the concept 'rat' is not further manipulated as a variable. 2 Some tests (instruments in the narrower sense) yield more than one score (variable). In this chapter, however, we shall not be concerned with such composite instruments, or rather, we shall for instance regard a test which yields n scores as a set of n in-
240
8; 1; 1
8; 1 I N S T R U M E N T A L U T I L I T Y O F A V A R I A B L E
'the instrument completely defines the corresponding variable' or that 'the construct is operationally defined by the instrument, as a variable.' Again, the terms 'instrument' (in the broader sense) and 'variable' approximate each other very closely. In many statements these two terms are directly interchangeable. This is particularly true for the utility criteria to be discussed in this chapter. It makes no difference, for instance, whether one speaks of the validity (8; 2) of a variable or of the corresponding instrument. There is of course a difference in meaning between the two terms (cp. the footnote on p. 144), which is in fact suggested by the words themselves. 'Variable' suggests primarily the varying empirical, numerical or otherwise coded, measurement results, in a universe, with its distribution and other empirical properties, etc. 'Instrument' primarily suggests the structure of the instrument-in-thenarrower-sense as well as the instructions, and the operations required to determine the value of the variable. Properties of the internal structure (8; 4) are by preference attributed to the instrument, relations to other variables by preference to the variable, although usage is certainly not very consistent. At all events, the scientist does not 'construct' a variable, but an instrument, and in doing so, he must observe certain construction requirements (rules, recommendations), whose content and purport in turn correspond to those of the criteria by which the instrumental utility of a variable is determined (see 8; 1; 2 and following). Finally, we call a variable 'objective' — an instrument a 'measuring device,' and determining the value of a variable, 'measuring' — if, from a certain point onward, 1 all the operations required to determine that value, have been objectively arranged, i.e., can in principle be struments (cp. also 9; 3). In this chapter 'instrument* will always be intended to mean 'instrument in the broader sense' (corresponding with one empirical variable), unless otherwise stated. 1 The addition 'from a certain point onward' is necessary because what we regard as the data, as 'primary' outcomes of observation or registration, may differ from case to case. One can, for instance, base one's research on observation protocols recorded by an experimenter or observer. If these protocols are regarded as the materials from which the variable is to be extracted, then the variable can be 'objective' — from there onward — in spite of the fact that the protocols themselves may be contaminated through the influence of the observer and affected by (systematic) 'distortions' and (accidental) 'noise' phenomena (see 8; 3). But if they are regarded from an earlier point, the variable is not objective owing to the presence of an observer-cum-judge. Similar considerations apply when the data are obtained 8; 1; 1
241
8.
CRITERIA
FOR
EMPIRICAL
VARIABLES
AND
INSTRUMENTS
performed by a clerk or a machine program of, generally single-valued1, transformations. The discussion in this chapter of the criteria for the instrumental utility of variables will mainly concern objective variables. As stated earlier (in 7; 3; 4, footnote, p. 233), however, these criteria are in principle the same for all types of variables, whether objective or non-objective. In general, the reader should be warned not to assume that the following discussion applies only to test variables. While it is true that the idea of keeping the instrumental qualities of variables under control by means of empirical criteria has been developed and given technical implementation chiefly in test theory, this by no means implies that its importance is confined to this area. Such concepts as validity, reliability, etc., are of general importance for the evaluation of empirical variables, irrespective of their content, origin, function, or form. 2 8; 1; 2 Instrumental utility: definition
The question as to what a variable is 'worth' has so far been discussed mainly in terms of 'relevance' versus objectivity. Thus far, the relative vagueness of the former concept has proved no obstacle to our discussion. But if one wishes to define this notion more precisely, a specification will be needed of the purpose or problem with respect to which something — a variable, a prediction (4; 1; 3), or (an answer to) a question (7; 1; 1) — is considered relevant or irrelevant.
through recording instruments such as a camera. Here, too, contaminations may have been built into the recording method (e.g., lighting or camera-angle effects), while transmission distortions and noise effects are also liable to occur. In this case, the question whether the variable is 'objective' depends on what one takes as his starting-point, that is, on what are regarded as the primary data. 1 The addition 'generally' is made to take into account the cases in which, e.g., a — likewise objective — randomization device (see 6; 3; 3 and 6; 3; 4) has been built into the measuring procedure. Thus, the instruction to the 'clerk' who is to measure the 'egocentricity' of a writer from the texts of personal letters (cp. 7; 1;2) might read: do not include all the letters or pages of letters but take only one fifth of them, and pick these by using a table of random numbers. For the term 'single-valued transformation' we refer to 6; 2; 1 and ASHBY 1957. 2 If this subject is to be treated with a view to application to a wide range of variables, certain generalizations — and even in some cases modifications of the meaning of certain established notions — are needed. The reader who is familiar with test theory will observe that, for instance, a concept like 'construct validity' will here be given a definition somewhat different from those found in most textbooks of mental testing (see 8; 2).
242
8; 1; 2
8; 1 I N S T R U M E N T A L
UTILITY
OF A
VARIABLE
With regard to what goal is the 'relevance' of empirical variables to be discussed here? To begin with, we shall not be concerned with the importance of the construct or concept represented by the variable. There are, of course, more and less valuable, more and less central concepts, both from the social viewpoint (application) and from the viewpoint of theoretical status. This, however, is a matter of content and meaning, which, for one thing, can be judged only within the field in question (cp. the restriction formulated in 1; 3; 2, p. 23) and which, for another, has received some attention elsewhere in this book. So we shall here confine the analysis to the properties of a variable as a representative of a construct-as-intended. This means, in effect, that we must now introduce an element of comparison into the evaluation question. If the legitimacy of the research goal, as embodied in a construct-as-intended, is not queried, then the issue resolves itself into the question as to how much an instrument (variable) is worth in comparison with other instruments (variables) that purport to represent the same construct and/or have been constructed for the same purpose. In practice, this question will arise, for instance, when a choice has to be made from an available arsenal of tests. In the American 'consumer society' ( R I E S M A N 1950), this may present a difficult choice problem, which, correspondingly, has received a good deal of attention in the test literature ( C R O N B A C H 1960, p. 96 ff.). But the assessment of the instrumental qualities of a variable is at least equally important for choice decisions in the construction of instruments (in the broader sense) — whether experimental or non-experimental. Generally, since all empirical research demands decisions about setting up, choosing, and evaluating variables, the problems that concern us now are crucial to a host of widely divergent fields. This leads us to search for viewpoints, criteria, evaluation methods, by means of which a comparative assessment can be made of what an operationally defined, empirical variable is worth as an instrumental realization of a construct-as-intended. Correspondingly, we must search for viewpoints, methods, and controls to aid in the construction of an optimally valuable instrument — 'valuable' again in connection with the construct-as-intended. The answer to the question concerning the usefulness of a variable as an instrumental realization of a construct is, by definition, termed the instrumental utility of the variable. The word 'utility' has been chosen here because terms like 'value' 8; 1;2
243
8. C R I T E R I A F O R E M P I R I C A L V A R I A B L E S A N D
INSTRUMENTS
(or 'relevance') may easily lead to all sorts of misunderstandings. The 'value of a variable,' for instance, often means the 'value' a variable may assume in a particular case. Admittedly, the term 'utility' is here used rather loosely, in particular without reference to methods of utility measurement. 1 Actual utility assessments in the usual sense are, to be sure, possible only if much more is known about the purpose and the conditions of the application of the instrument (cp. e.g., CRONBACH and GLESER (1957) 1965). As regards our concept of instrumental utility, however, it can at least be said to incorporate some crucial and indispensable ingredients of any operational measure of the comparative utility of an instrument. In particular, the functional expression for its calculation must obviously contain parameters representing, respectively, the validity (8; 2), the accuracy (8; 3), and the internal efficiency of the variable (instrument) in question(8; 4). The notion of utility comprises at least these three aspects; the term 'instrumental utility' therefore appears reasonably compendious and adequate. It is to be noted that the above term, construct-as-intended, always requires the mental addition: in a given research context. Cronbach observes (op. cit., 1960, p. 96 if.) that it is quite meaningless to ask, for instance, what is 'the best intelligence test.' That the concept 'intelligence' be given is not enough: which intelligence test is best in a certain context will depend on the purpose and design of the particular investigation. To some extent, this is a matter of practical considerations (availability of experimental subjects, budget, more generally: practicability), which we must pass by now. However, there are also more basic and theoretical distinctions affecting the viewpoints and methods for assessing instrumental utility, such as the question whether the variable is to serve as a predictor of something else, or as the entity to be measured (e.g., criterion). We shall see later that this distinction between measure and predictor is reflected in different validity concepts (8; 2). The reader may have noticed that there is a marked analogy between these problems and those encountered in the evaluation of the effectiveness of methods of influencing human behavior, discussed in 6;2;2. In the type of 'evaluation' that concerns us now — assessment of the 1
Applying the convention stipulated in 8; 1; 1, this means that we are not yet in a position to speak of 'the utility variable.'
244
8;1;2
8; 1 I N S T R U M E N T A L
UTILITY OF A
VARIABLE
value of variables — it is also of vital importance to keep the goal (construct-as-intended) closely in view and to define operationally the effects to be obtained, in order to develop empirical measures for instrumental utility (effectiveness). What an instrument is worth as a representative of a construct-asintended (in a given research context) will naturally depend on the manner in which it is constructed. In the case of instruments like tests and questionnaires — but also with, say, job evaluation, composite economic indices, average school grades, etc. — elements (components, subscores, separate primary data, or items) typically are combined in a specific manner to produce a 'final score.' What the instrument as a whole, or the corresponding variable, is worth, will depend then (a) on the choice of proper (relevant) items (7; 1; 1) and (b) on the manner in which the responses are ordered, weighted, added up, averaged, or otherwise combined. Not all instruments in the social sciences allow this kind of breakdown; some are, in particular, of a much simpler structure (one trivial example: age, as written on a form by the subject). Generally, however, more than one response or other elementary measure is needed to determine the value (final score) of the variable. In what follows we shall confine the discussion to this basic composite form: items, themselves not yet variables, have been constructed and/or selected-, each item is to be scored objectively; item scores are combined according to an objective formula to produce a final score-, this final score is the value assumed by the objective variable. The entire operation whereby items are selected or constructed and objective arrangements are made for scoring and combining them, then, constitutes 'the construction of the instrument.' If the items are questions to be answered by experimental subjects or respondents, construction also comprises whatever instructions are needed for the proper administration (cp. 6; 2; 3) of the instrument. What requirements of instrumental utility must now be fulfilled by this construction, that is, by the instrument? For the special case of test construction there is an extensive literature on this subject (e.g., GULLIKSEN 1950; LINDQUIST 1959; A.P.A. (1952) 1954, 1955, 1966; CRONBACH 1960). The essential viewpoints set out there may be summarized as
8; 1; 3 Three construction requirements', three criteria
8;1;3
245
8.
CRITERIA
FOR EMPIRICAL
VARIABLES AND
INSTRUMENTS
follows: the construction must take place in such a manner that (1) the resulting variable may be considered an acceptable, adequate (valid) representative of the construct-as-intended; (2) the instrument performs the measurement with reasonable accuracy, and (3) the instrument is efficiently organized. In what follows, however, we shall but rarely discuss the precise manner in which these three requirements govern the process of instrument construction, since this would carry us too far into technical details. Instead, they will be treated chiefly as criteria by means of which, on the strength of empirically obtainable data, the value (utility) of instruments already constructed, and of the corresponding variables, can be measured. It should hardly need stressing that these criteria can of course be applied also to preliminary versions of the instrument, that is, at earlier stages of the construction process. The first requirement — that the variable must adequately represent the construct-as-intended — is a special form of the question as to how the construct relates to the (operationally defined) variable, a topic which already received some attention in 3; 5; 5 and 6; 2. The question that occupies us now is whether the variable has 'validity,' whether it may be considered a 'valid' representative of the concept. The relation between construct-as-intended (in a given research context) and variable is now determined from a quantitative and an empirical aspect: to what extent is the variable found to be an adequate representative of what was intended by the construct and its instrumental realization? The second requirement — that the instrument must measure with reasonable accuracy — requires little comment. Here a basic problem is that, empirically, one can determine the accuracy of a measurement only by repeating it a number of times; but if this is done, the stability of what is measured will itself affect the result. These two factors are often difficult to separate, particularly for behavioral variables; both often play a part in what is commonly called the (measurement) reliability of an instrument. They can be distinguished, however, as accuracy and stability (8; 3). The last requirement — that the internal structure of an instrument must be efficient — can be elucidated simply by pointing out that two equally valid and equally reliable instruments, while yielding equivalent results, may do so at different costs. Compared with the other, one instrument may take more administration time; it may contain super246
8;1;3
8; 1 I N S T R U M E N T A L
UTILITY
OF A
VARIABLE
fluous or unsuitable items (questions) that contribute nothing to the result; it may combine in one score two (or more) distinct, and separable, components; its scoring may be inefficient; etc. Fundamental problems arise when these questions are worked out in more detail. They will be dealt with in 8; 4 under the heading: Internal efficiency and scoring. At first glance, it may seem that validity, by the above definition, is not only a determinant of the instrumental utility, but virtually identical with that utility. Indeed, what a variable 'is worth as an instrumental realization, etc.' (8; 1;2) is to a very large extent identical with the degree to which it is 'an adequate representative' of the construct-asintended. A variable with guaranteed validity for a given purpose will also have guaranteed, albeit not maximal, instrumental utility; an instrument of inadequate validity is indeed worthless — for that purpose. This implies that questions of reliability and internal efficiency (8; 3 and 8; 4) can be only of secondary importance beside the validity question (8; 2).
Still, the identity is not complete; there is room for the two other aspects. As regards internal efficiency, this is obvious: if it can be increased while the validity remains stable, then, by definition (p. 244), the instrumental utility will be increased. For the accuracy (or reliability) of the instrument it is better to adopt another line of reasoning. If it is improved while the validity remains the same, this admittedly has little significance from the point of view of instrumental utility. However, improving the accuracy of an instrument is likely, in general, to enhance the probability of improved empirical validity findings (see 8; 3). Precision in measurement does not make what is measured more important, more adequate, or more valid — an oft forgotten fact, for that matter. But, i/what is measured is in principle of importance, then positive validity is more likely to appear with high than with low accuracy instruments.
8; 1; 3
247
8. C R I T E R I A
FOR EMPIRICAL VARIABLES AND
INSTRUMENTS
8;2 V A L I D I T Y
8; 2; 1 Criterion validity as a simple operational concept
The simplest and most clearcut variant of the validity concept is that of criterion validity. This is involved whenever a variable is expressly intended to predict something else, the latter here termed a criterion variable (cp. 7; 3; 1). What such a 'predictor' variable, itself, represents is then of secondary importance. The better the predictor is found to relate to the criterion, the higher its criterion validity. The correlation between predictor and criterion is therefore of decisive importance and may serve as an operational definition of criterion validity (cp. e.g., K O U W E R (1952) 1957, p. 49). All the same, it is of importance not to lose sight of the difference between the validity value computed for a given sample and the supposed value of the validity coefficient in the universe, which, as a rule, cannot be measured but can only be estimated on the strength of the observed validity outcome. Terminological usage, as well as conceptual discriminations, is often loose on this point; both are frequently called 'the validity or validity coefficient of a predictor.' In the determination and the analysis of criterion validity all sorts of complications may arise. Sometimes the researcher is interested in eliminating reliability deficiencies of both predictor and criterion in order to discover the strength of the (causal) interdependence of the underlying factors. He may then attempt to estimate the validity of an ideally reliable predictor with respect to an ideally reliable criterion (the 'correction for attenuation,' see e.g., G U L L I K S E N 1950, Ch. 9;8). Or, he may want to have an estimate of the validity remaining after the elimination of the effects of one or more other variables, and thus calculate a 'partial correlation' (cp. e.g., GULLIKSEN, op. cit., Ch. 12). Or again, when he has available a validity coefficient for a sample which itself has been selected in part on factors related to the predictor, he will try to correct for the effect of this selection. For instance, if for his empirical validity research he has available only the group of candidates admitted after selection, he may use these data to estimate the validity coefficient that would have been found if the rejected candidates had also been included in the sample (correction for 'restriction of range,' op. cit., 248
8:2; 1
8; 2
VALIDITY
Ch. 11; see also e.g., THORNDIKE 1949, Ch. 6). Or, he may be interested, not in the validity of one variable, but in that of a combination of predictors, possibly with optimum weighting (e.g., 'multiple correlation,' cp. CRONBACH 1960, p. 399 ff.). Finally, it is of vital importance, notably in cases where the original validity calculation was of an exploratory nature, to strengthen the basis for an estimate of the validity in the universe by doing a cross validation on a fresh, independent sample.1 All these complications and many others however, cannot alter the fact that the basic idea of criterion validity is simple and enlightening. It should be borne in mind, however, that the operational conception of criterion validity provides a completely satisfactory answer to the question whether a variable is valid only if: the variable is intended as a predictor (1), notably in a particular research context for a specific prediction purpose (2), which itself is adequately (validly) measurable, i.e., completely covered by the criterion variable employed (3). 8; 2; 2 Criterion problems
Sometimes — albeit seldom — these conditions are indeed fulfilled. Standard examples may be found in textbooks on industrial psychology (e.g., TIFFIN and MCCORMICK 1958, Ch. 5: Aptitude Tests). Suppose that in a workshop certain jobs require a specific skill, which many job entrants clearly have not sufficiently mastered after a few months' training. If it is attempted to solve this problem by advance personnel selection, a valid predictor must be found, e.g., an aptitude test. Two conditions of 8; 2; 1 are thus fulfilled: there is a predictor (variable) and a specific purpose. As for the criterion, if we assume
1 The question may be asked whether validation research is a form of hypothesis testing. The assumption tested, namely that a certain variable is a valid predictor of a certain criterion, can in fact be called a hypothesis — provided the variable was included in advance for this purpose. If this condition is fulfilled, the hypothesis tested would amount to the assertion that there is 'some linear interdependence' or possibly 'some causal relationship' between predictor and criterion. As regards the strength of this relationship, it is true that the attitude of the validation researcher is often exploratory (cp. 2; 2; 3): 'Let us see what we can find'; but then, it is the cross validation which is the real hypothesis testing — provided explicit expectations are stated in advance on the strength of the relationship (e.g., the size of the correlation coefficient).
8; 2 ; 2
249
8.
CRITERIA
FOR EMPIRICAL VARIABLES AND
INSTRUMENTS
that an empirical standard can be designed for the (degree of) skill in this particular field, the third condition is also fulfilled. Through determination of the, operationally defined, criterion validity, the question of the validity of the variable — and virtually also that of its instrumental utility — has in principle been solved (but cp. the footnote on p. 253). In the following example the third condition is not fulfilled: the researcher knows what he wants (to predict), but the criterion is questionable. The illustration has been taken from an entirely different field, so as to prevent a possible fixation on test applications. Suppose a researcher wants to test the dependability of the C-14 method to deteimine the age of (pre-)historic objects by means of the radioactivity of carbon. This problem can be formulated and investigated in terms of criterion validity.1 The variable (predictor) is then: age according to radioactivity; the criterion: age according to historical experts. The sample is composed of objects of varying and well-established (criterion) age; and the question is whether the radioactivity variable can adequately predict this criterion. This, too, is 'prediction' in our sense (3; 4; 1): predicted are the outcomes of a scientific investigation. The fact that the investigation (establishing criterion ages) has already been performed —• or, the fact that we have to do with postdiction — is no obstacle, so long as this does not affect the predictor. So we have a predictor (1), for a specific purpose (2): determination (prediction) of age. But is the criterion satisfactory? It is possible, of course, to assert that the third condition is indeed fulfilled: the testing was performed on a sample — and thus refers to a universe — for each element of which the age is assumed to be 'wellestablished.' Although the instrument, upon its proven validity, will be used especially in cases where there is doubt as to the dating —• i.e., with respect to a different universe —• the procedure is legitimate, in principle (see below, p. 251). However, the assumption that the opinion of historical experts is correct is open to question, even in those cases where the age of the object is said to be 'well-established.' In other words, the question may be asked whether the criterion variable has validity, now with regard to a theoretical, essential criterion: the real age. True, this criterion validity cannot be determined so long as the 'real age' is unknown, but it is possible to question its accuracy and to designate the opinion of historical experts as 1
It can also be formulated in a different way, as will be seen later.
250
8; 2; 2
8; 2
VALIDITY
a substitute criterion for this essential criterion. Further, the essential criterion may materialize some day, that is, it is not impossible that a method will be developed that may be regarded as providing a better approximation to the real age than the historians' opinion — for instance the C-14 method! If this method finds acceptance as such — which is by and large the case these days — then the tables are turned in the process of determining the validity of pronouncements on historical age: what was formerly predictor is now criterion; what used to be criterion is now predictor. Such a reversal of the functions of criterion and predictor is a common phenomenon. It may be of crucial importance in the development of both constructs and instruments. Thus, intelligence tests (or the 'neuroticism' variable) used to be validated from intelligence (neuroticism) ratings by teachers (psychiatrists), whereas nowadays the reverse procedure has become possible. There is something paradoxical in this development; it reminds one of the man who pulls himself out of a bog by his own bootstraps ('bootstraps effect,' CRONBACH and MEEHL 1955; or: 'Münchhausen effect,' WIEGERSMA 1959, p. 119). But the procedure is entirely legitimate, as appears from the C-14 illustration. First, it is possible to choose 'well-established' cases for the testing, and thus to strengthen the original criterion basis, while, secondly, the nature of the new instrument guarantees improved reliability — which, moreover, can be checked empirically (8; 3). In applied prediction problems, in particular in the area of selection in psychology, the time dimension often plays a role in the criterion issue. The ultimate criterion is often taken to be the final outcome — e.g., professional success, social adjustment — that should really be measured after many years, whereas actually an intermediate (substitute) criterion is employed — e.g., success after one year of training. Whenever this is done (cp. e.g., VAN DER GIESSEN 1957), the usefulness of the validity outcome depends on the ultimate validity of the substitute criterion itself. In some cases the latter validity can be empirically determined at a later stage — provided the data, or even the selection problem itself, does not become obsolete. In research practice, however, intermediate criteria, if validated at all, are mostly supported by correlating them, not with the, generally unattainable, ultimate criterion, but with other, less provisional, criteria. This may lead to what has been called the 'infinite frustration' (GAYLORD, quoted in CRONBACH and MEEHL 1955) of an incessantly repeated search 8; 2; 2
251
8
CRITERIA
FOR EMPIRICAL
VARIABLES AND
INSTRUMENTS
for a relation to a 'more essential' criterion — i/the researcher adheres exclusively to a predictive validity conception.1 So far it has been assumed that, while the prediction purpose could not be converted into a measurable essential criterion, there was at least no uncertainty about what this essential criterion should be (cp. above: real age). Frequently, however, the researcher has not a readily operationalizable conception of what he ('essentially') wants, so that the essential criterion remains vague: pluri-interpretable and often implicitly multidimensional. If, for instance, he attemps to validate predictors for academic success — in fact a relatively simple case — the distinction between 'good' and 'poor' students, between those that are 'fit' and 'unfit' for the chosen study program, is hard to pin down in a satisfactory way. For instance, in the continental situation: Is a student who completes his studies within the officially set time, but with moderate results, a 'better' student than one who with a retardation of a year or two, is awarded A grades ? It has been shown that validities computed on these two criterion types tend to differ greatly (SPITZ 1955). The criterion concept — success, 'proven' capabilities or potential 2 — is often interpretable in terms of different objectives and interests: professional success in later life; study achievements proper; adjustment to the peer group, both in school and professional life; or, a concept of success (or capabilities) defined from the individual's viewpoint (TECHN. INST. DELFT 1959, Ch. 9). Multidimensionality of the objective, if resulting in more than one criterion, is not in itself an insurmountable problem; a number of appro1
In some cases the validation researcher does not wait at all. He may use a 'concurrent' (synchronic) criterion instead of an intermediate one. In the above selection example, for instance, it would not be the fresh job entrants but already hired employees who were tested and whose test scores would be compared with their present job skills. This method of concurrent validation of an instrument — as distinct from determining its predictive validity — is based on rather strong additional assumptions that are rarely satisfactorily fulfilled. There are cases, however, where the determination of a concurrent validity coefficient does provide useful information ( c p . e . g . , CRONBACH 1 9 6 0 , p . 1 0 4 f f . ) .
According to our definition of prediction (3; 4; 1), concurrent validation is no less 'predictive' than is predictive validation. Both are subtypes of the general case of criterion validation, in the process of which the validity of a variable is defined operationally as the degree to which it correctly predicts another variable, which is called the criterion. 2 While success can be considered to prove that the person has (had) the pertinent capabilities, failure can not be said to disprove it. For this reason, success criteria hardly ever cover what one would 'essentially' like to have.
252
8; 2; 2
8; 2
VALIDITY
priate procedures arenowadays available (see, e.g. CATTELL1964,P. 166ff.).1 The crucial problem, however, is not multidimensionality but multiinterpretability. Most criterion constructs — success, professional achievement, adjustment, mental health, productivity, creativity — are merely vague, i.e., they are apt to carry extensive and ambiguous surplus-meanings, if they are not thoroughly analyzed in advance. Hence, where problems of criterion validity are concerned — as in the analogous case of evaluation studies (cp. 6; 2; 2)—validation research must start with criterion analysis and choice, i.e., with an operationally directed but thorough analysis of the research goal. The 'criterion problem' (for a discussion see e.g., KELLY and FISKE 1951, and VAN DER GIESSEN 1957) is basically a question of objectives, which must be solved by rational thinking. Common shortcomings are, on the one hand, premature operationalism — i.e., avoidance of thinking, possibly for fearof'mentalism'—and, on the other, irrational thinking, in the form of quasi profound but fruitless debates. The total picture of the validity of a variable becomes even more complex when this variable is actually used in different contexts, i.e., as a predictor of different criteria, in different investigations. If the predictive philosophy implicit in the criterion validity concept is adhered to in that case, one can only enumerate which correlations, in which kind of investigations, in which samples from which universes have been found. A combination formula, a comprehensive assessment of 'the' validity of the variable in predictive terms, is out of the question. In test psychology, certain well-known instruments such as intelligence and personality 1 As a matter of course, these multivariate methods for combining criteria — in fact: for combining vastly differing points of view — into a formula hinge, again, on fresh assumptions that may or may not be fulfilled. In general, the combination problem can best be tackled by making use of the utility concept as a common denominator — 'utility' now in the technical sense, i.e., with proper calculations of costs and gains involved in various possible strategies for selection or placement (CRONBACH and GLESER (1957) 1965). This approach is also useful when there is only one criterion, for that matter. It is even quite useful when outcome values cannot be adequately determined and gains, therefore, not calculated — its main merit being that it forces the researcher to specify his objectives. Even so, we cannot follow CRONBACH (1960) when he makes the validity concept the handmaid of decision-making: 'Validity is high if a test measures the right thing' — so far a fine, elliptical definition of validity, but he continues: 'i.e., if it gives the information the decision maker needs.' In a practical handbook on mental testing such a pragmatic conception may be tenable; however, it does not cover the meaning aspect of (criterion) validity data: they can also be highly important in determining the theoretical significance of a variable regardless of possible decisions.
8; 2; 2
253
8. C R I T E R I A
FOR EMPIRICAL
VARIABLES
AND
INSTRUMENTS
tests are often used as, and also called, 'predictors'; but they cannot be said to meet the first condition of p. 249: they are not exclusively intended as predictors. This is even more true of a large number of other attributes of persons: yardsticks for scholastic achievement (whether diplomas, grades, or achievement test scores) and for social adjustment, as well as environmental factors (e.g., position among other children in the family; see e.g., SCHACHTER 1959); and of most other attributes, e.g., the effectiveness of a group (FIEDLER 1958), the readability of a text, the difficulty level of a job, the gross national income of a nation — to cite a few rather random examples. All variables of this kind require a different approach to the validity question. 8; 2; 3 Construct validity: measurement versus prediction
Whenever the predictive power of a variable is to be determined, prediction of something else is the main concern in the instrumental realization of the construct. Its substantive content is relegated to the background, to such an extent in fact that it is sometimes said that the predictor actually 'measures' the criterion. For instance, in administering an aptitude test, one 'measures' the subject's aptitude, although it is quite clear that his real aptitude could only be established at some future time. In fact, the difference may be extremely subtle, as in the case of the determination (measurement or prediction?) of the age of prehistorical objects by the C-14 method (8; 2; 2). Apparently, the research design and the underlying, possibly implicit, conception of the problem may determine whether measurement or prediction is involved. Nevertheless, the distinction becomes quite clear when this viewpoint is adopted: prediction requires, in addition to the variable involved (the predictor), the presence of another variable (the criterion); whereas in the case of measurement the crucial question is how a variable relates to a construct, the construct as intended to embody the attribute or property in question. The analysis of the validity of a variable intended to measure, or nonobjectively to 'determine,' a construct cannot be reduced to a matter of prediction of one or more variables. The issue here is construct validity, that is, the validity of the variable with regard to the construct it represents. The term construct validity was first introduced in the test literature by C R O N B A C H and MEEHL (1955). It was coined in connection with tech254
8; 2; 3
8; 2
VALIDITY
nical problems concerning validation of psychological tests, following discussions within the Committee on Psychological Tests of the American Psychological Association. These discussions concerned the requirements for publication of a new test and its accompanying manual of instructions (cp. A.P.A. (1952) 1954, 1955, 1966). By then the Committee's members, dissatisfied with then current views, had become convinced that the three generally accepted types of test validation and validity — predictive, concurrent, and content (see p. 257 )— were insufficient. The term 'construct validity,' and the idea of validating a test with regard to the meaning and the nomological network of the construct (cp. 3 ;3 ;2), were proposed by the Committee, but the theoretical elaboration of the idea was first presented in the article cited above ( C R O N B A C H and M E E H L 1955), with special reference to 'psychological tests and diagnostic techniques.' 1 In line with this specific purpose, Cronbach and Meehl defined a 'construct' as: 'some postulated attribute of people, assumed to be reflected in test performance.' In the article they mention various examples: 'amnesia' as a qualitative attribute, which may or may not apply to an individual, 'cheerfulness' as an attribute which a person may possess to a greater or less degree, and many others. For us, however, there is little reason to restrict the use of 'construct' to personal attributes or personality variables. Construct validity is of essential importance for any empirical variable regarded as the instrumental realization of a construct, a concept, or some otherwise formulated measurement goal, in whatever field of behavioral research. In test psychology, the introduction of construct validity marked deliverance from a narrowly restricted operationism. Suppose that four tests are available for (compulsive) rigidity as well as psychiatrists' ratings for rigidity, and that the correlations among these five variables are found to be positive. According to the classic view of (criterion) validity, the tests are viewed as predictors of a criterion, i.e., (for instance) the psychiatric opinion. It is of course possible to change the relative positions of predictor and criterion; but the pattern must always be asymmetric. In a case like this, however, such an asymmetric argument of the 'testshould-predict criterion' type (op. cit., p. 285) is artificial. Neither the 1
In the meantime a considerable volume of literature has appeared, supporting (e.g., LOEVINGER 1957) or attacking 'construct validity' (notably BECHTOLDT 1959); s e e a l s o CAMPBELL 1 9 6 0 .
8; 2; 3
255
8. C R I T E R I A
FOR E M P I R I C A L
VARIABLES AND
INSTRUMENTS
psychiatrists' judgments, nor any one test score is 'the' rigidity; all five instruments represent attempts to determine a common factor; all five variables represent, let us hope, some aspects of the construct 'rigidity.' According to Cronbach and Meehl, it is particularly in the area of clinical diagnosis that validation in terms of specific criteria — that is, criterion validation, in our terminology (8; 2; 2) — is frequently inadequate. The psychologist engaged on work in this field may use a test to provide an estimate of a hypothetical internal process, a hypothetical factor, structure or condition, for which no clear-cut behavioral criterion is obtainable. 'An attempt to identify any one criterion measure or any composite as the criterion aimed at is, however, usually unwarranted' (Technical Recommendations, as quoted in C R O N B A C H and MEEHL, op. cit.). In other words, the construct unavoidably has a surplus meaning with respect to any empirical criterion (cp. 2; 3; 6). There is, again, no reason why this line of reasoning — for it is as such, rather than as a specific method, that the notion of construct validity is presented (cp. op. cit., p. 300) — should be confined to test psychology. Cronbach and Meehl themselves cite the example of 'hunger' in experiments of animal psychology; the investigator who endeavors to give a theoretical description of the behavior of 'hungry' rats will almost certainly attribute a wider meaning to this term than that provided by the usual operational definition 'elapsed-time-since-feeding' (op. cit., p. 284). Precisely the same situation obtains with respect to constructs like 'group effectiveness' ( F I E D L E R 1958) or the 'amount of communication' within a group ( B A V E L A S 1950) — or with respect to the operational definitions (indices) employed by economists when, for instance, they seek to compare the 'living standards' of various nations. These examples can readily be multiplied: the 'difficulty level' of an assignment, 'sickness' versus 'health,' 'democratic' versus 'non-democratic' procedures, the 'social status' of a profession, etc. The social sciences abound with constructs that have a surplus meaning with respect to whatever operational definition is proposed. It is utterly futile to adhere to an operationism that throws overboard the intended meaning, the explanatory or descriptive idea.
256
8; 2; 3
8; 2
VALIDITY
8; 2; 4 Contributions to construct validity
There are two crucial questions in the construct validation of a variable: first, what sort of empirical data can furnish contributions to construct validity, second, how can these data be combined to arrive at a statement concerning the (degree of) construct validity of the variable? In other words, how can, for a particular variable, 'evidence from many different sources' be integrated? (Technical Recommendations, quoted from C R O N B A C H and MEEHL 1955; see also LOEVINGER 1957). What are these sources and how are their contributions integrated? Obviously, it will be harder to give direct and simple answers to these questions if the construct in question has a more hypothetical character (2; 3; 6). Let us start with a simple case, a construct (concept) like 'arithmetic skill' or 'reading skill.' While it is true that this is also an attribute of a person (child or adult), little if any theory is needed for its operational definition: the instrument, for instance, a simple test of reading rate, is constructed solely to measure the subject's skill in this particular field. How can one determine the validity of such an instrument? Invoking the concept of criterion validity makes little sense here, since no attempt is made to predict something else. But it does make sense to ask: 'Is this a real test of reading (or arithmetic) skill?,' that is, do the test items, and the instrument as a whole, adequately represent the construct, concept, or educational goal in question? That is the validity problem. For instance, are all the main aspects of arithmetic skill, at the educational level in question, represented in it (different types of sums, different operations)? And do they carry their 'proper' weights? In such a case the validity question is usually approached via a stepwise operationalization process: from educational goal — if that is the starting-point — to representative concept ('arithmetic skill'); next, to a set of subtypes of arithmetic skills; finally, to a theoretical collection of arithmetic problems (items) representative of each of these skills. At this point, the main validity question reads as follows: Can the given series of test items be regarded as a, sufficiently large and adequately differentiated, representative sample from this collection of all possible items? Thus formulated, the problem permits an empirical or statistical approach, and is commonly termed one of content validity. This notion is considerably older than that of construct validity, which Cronbach and Meehl are careful to distinguish from it. It will be clear, however, that according to our definition we have here a simple case of construct validity. 8; 2; 4
257
8.
CRITERIA
FOR
EMPIRICAL
VARIABLES
A N D
I N S T R U M E N T S
Also in the case of instruments where more is required than that they comprise a well-chosen sample of items covering a specific area of be havior, content validity will frequently play an important role, but then as a contribution to construct validity. An intelligence test, for instance, must not contain items based on mere recall, that is, not on that type of memory A L F R E D B I N E T once called 'la grande simulatrice de 1'intelligence'. This is, at least in part, a matter of content. True, it is often difficult to answer the question whether a test item 'really tests intelligence' but, in trying to answer it, one must certainly not forget to consider item content. In fact, whenever an instrument for a newly developed construct is designed, content is the first — albeit rough — criterion applied in the selection of items. The question whether the instrument, qua content, qua coverage of what is intended, answers to the construct, i.e., the content validity question, is of fundamental importance for any instrument that is more than a specific predictor (cp. 8; 2; 2). In test theory there is an unfortunate tendency to ignore this aspect (cp. G U T T M A N ' S criticism (1953) of G U L L I K S E N 1950) because of its resistance to quantification. Meanwhile our example of the intelligence test provides a good opportunity to illustrate other types of empirical contributions to construct validity. When a new test is constructed which claims to measure intelligence, an obvious requirement is that it shall exhibit a high correlation (say, about r = .80) with other intelligence tests that have already found acceptance as such. In fact, the same applies to, for instance, a reading skill test — in addition to its evaluation in terms of content (as well as reliability and consistency: 8; 3 and 8; 4). This correlation with a fellowinstrument which is supposed to measure the same general construct is sometimes called 'congruent validity.' In our terminology, it is a special form of criterion validity, of importance chiefly for its contribution to construct validity. Criterion validity will also make important contributions to construct validity other than with respect to congruent variables. We know, theoretically, that intelligence-as-intended is an important factor for 'intellectual' achievements, say in school or college; and we know that intelligence tests usually exhibit a positive correlation with scholastic achievements, both concurrently and with respect to future achievements. Consequently, a new intelligence test is likewise expected to do these things.1 Or, to 1
In this context, the research goal of constructing a new test for (general) intelligence
258
8; 2;
4
8; 2
VALIDITY
put it more generally: if there are other instruments that embody accepted, albeit different, operational definitions of a construct, then a new instrument is expected to exhibit in all its empirical connections largely the same patterns of relationships as its fellows. If we know, for instance, that the chances for a favorable effect of psychotherapeutic treatment depend, apart from the degree of neuroticism and the nature of the conflict, also on the patient's intelligence — measured according to test A, B or C — then this composite relationship should be demonstrable also for D, if test D is to be given high marks on 'construct validity.' 8; 2; 5 How to assess construct validity: a theoretical problem
If there are no other instruments that represent the construct in question nor any sufficiently established patterns of empirical relationships which the construct must meet, then content validity will often offer the only possible starting-point — perhaps supplemented by criterion validity with respect to a criterion based on ratings. Even then, however, there will usually be a background of theory of greater or smaller complexity and degree of scientific pretension.1 For, constructs are invented to introduce distinctions that are meaningful, i.e., which allow the formulation of expected relationships (hypotheses) that can be tested. This means that in a case like this further empirical elaboration of the nomological network surrounding the construct will make available other criteria by which the construct validity of the variable can be assessed. The empirical findings for the variable must then continue to be in line with expectations based on the theoretical relationships of the construct. If they are not, the probability is that there is something
is taken for granted. If a different goal prevails — e.g., to factor-analyze intelligence, or to develop new constructs in the field of cognitive abilities — less 'conservative' validation strategies are called for. 1 The reader will have noticed that in this book the term 'theory' itself is expressly used in a non-pretentious, non-esoteric, non-exclusive way — in line with the development of terminological usage in applied fields, e.g., educational psychology. In addition, the foregoing discussions have deprived the term 'construct' of its onlyfor-pure-scientists halo. In the expression 'construct validity,' at least, the meaning of the term construct has been expressly extended to cover 'concept,' 'notion,' and even 'intention' or '(educational) goal.' There are, of course, concepts of higher and lower scientific status, but it does not appear useful to distinguish them by different terms. The validity issue, in particular, must not be restricted to high-status constructs; it is of primary methodological importance all over the range from pure to applied social science.
8; 2; 5
259
8.
CRITERIA
FOR EMPIRICAL VARIABLES AND
INSTRUMENTS
wrong, either with the theory, or with the construct validity of the variable representing the construct. This may — but need not necessarily — lead the researcher to discard the instrument. It is also quite possible that he does find important and consistent empirical relationships, but that these are different from what was envisaged in the original construct and the original theory. This possibility has been pointed out before (3; 3; 5 and 4; 2; 4). In the research process the construct-as-intended is not a constant entity to which the variable must invariably conform. Empirical findings with certain variables will often occasion the corresponding constructs to shift their meanings, become more sharply focused, be restructured — and sometimes rechristened as well; cp. e.g., the use of a 'lie score' as a measure of 'rigidity' ( B A R E N D R E G T 1961, Ch. 12). Changes in the construct itself must, of course, be accompanied by changes in the criteria for the construct validity of the variable. The nomological network then assumes a different structure in its empirical elaboration from that originally envisaged. But, on the basis of this restructured nomological network, it is again possible to arrive at an assessment of the variable's construct validity. It will be clear by now that the problem of how to make a quantitative appraisal of the construct validity of a variable cannot be solved by means of a simple formula. Apart from the theoretical intentions and implications •— which may shift — the assessment will depend necessarily on such highly heterogeneous 'contributions' to construct validity as: evaluations of content validity, congruent validity, predictive and concurrent validity in different populations and with respect to frequently widely divergent criteria (cp. also 8; 2; 2, p. 253), and in general on 'patterns' of empirical findings concerning the variable in question. It is impossible to combine all these into a universally valid formula. At best, a comparative evaluation of these contributions can, in certain cases, produce reasonably well-founded statements of the type: instrument A has better construct validity with respect to construct X than instrument B. In view of the fact that criterion validity outcomes are frequently used to assess the construct validity of a variable, the question may arise whether the entire criterion validity concept cannot be subsumed under construct validity. Indeed, those cases where criterion validity in itself provides an adequate validity measure could be considered special cases of construct validity, in which the predictive character rather than other 260
8; 2; 5
8; 2
VALIDITY
determinants of the meaning of the construct is emphasized. In any event, this view is less sterile than the opposite one, which, as pointed out in 8; 2; 3, long impeded progress in theoretical discussions of the validity problem. 1 As stated earlier, predicting a criterion can generally be conceived of as 'measuring,' e.g., aptitude, potential, or promise. Moreover, even in the case of seemingly simple problems of criterion validity, it does frequently make sense to go beyond the analysis and investigation of the predictor-criterion correlation(s). Even problems of reliability (8; 3) and internal consistency (8; 4) may be viewed from the aspect of construct validity, notably in the sense that these, too, are expressly based on what the investigator has in mind, on an analysis of his research goal(s), on the 'construct-as-intended.'2 Construct validation, in fact, takes us back to the heart of the problems of theoretical evaluation discussed in 4; 2. Within that area, however, it is a problem that can be clearly delimited from others and studied separately for each variable and for each instrument. As pointed out in 8; 1; 3, it is useful that the answer be sought both in an expressly empirical and in a quantitative sense. As a viewpoint, it is undoubtedly important, even though — just as in the evaluation of theories — one will often have to be content with empirically based, but not quite cogent, comparative evaluations that must await acceptance by the forum.
1
This is not to say that this opposite view is not tenable. In cases where no measurable 'essential criterion' for a construct is available — i.e., a criterion embodying exactly what we want to measure — it is still possible to maintain that the validity problem could be solved in the predictive sense by computing correlation coefficients (8; 2; 2), if a. criterion were available. Both ideas, criterion validity and construct validity, can be maintained philosophically as polar validity conceptions, each of which encompasses the whole range of possible validity concepts. Correspondingly, the somewhat artificial dichotomy of construct versus criterion validity can be developed into a continuous series representing degrees of operationality of the validation procedure. 'X-validity' then would denote: validity (of a variable or instrument) to be discussed, analyzed, or assessed with regard to X — where the term X might represent a series somewhat like this: hunch, notion, intention, rationale, objective, goal, effect, criterion variable (roughly in an order of increasing operationality). 2 Jane Loevinger goes even further and claims that 'a method of test-construction based on construct validation,' if worked out systematically, 'can dispense with testretest and parallel form reliability' (LOEVINGER 1957, p. 689).
8; 2; 5
261
8. C R I T E R I A F O R E M P I R I C A L V A R I A B L E S A N D
INSTRUMENTS
8; 3 A C C U R A C Y A N D S T A B I L I T Y ; RELIABILITY 8; 3; 1 Differentiation of the measurement scale
The question as to the degree of precision with which an instrument measures may be considered from different viewpoints. A transparent and modern general definition is: An instrument measures the more precisely, the more relevant information, on the average, one outcome provides with respect to the value of the corresponding variable. Further specifications will then be needed, first, of what we shall understand by (quantity) of 'information' and, secondly, of what the term 'relevant' means in this context. For the moment (8;3;1), we shall confine our attention to the first problem and to the empirical criteria that can be formulated for accuracy irrespective-of-relevance. Obviously, this is entirely dependent upon the properties of the scale in which the variable in question is measured. Apart from the type of measurement scale (7; 2; 2), the crucial property is the degree of differentiation between measurement values which the scale permits. 'Accurate measurement' in this sense means that many distinctions between measurement values can be made. If, for instance, within a given body of measurement values, the degree of differentiation is diminished by condensing classes — say by measuring decimeters instead of centimeters, or by allowing ties in an ordinal scale, or by reducing a nominal categorization comprising five classes of religion to a dichotomy (R C versus non-RC) — there will be some 'loss of information'; the measurement scale becomes less precise, less accurate. An obvious measure for the degree of differentiation provided by the scale1 is the number of categories or classes that are, or can be, distinguished (K). As is well-known, this number is nowadays usually replaced by its logarithm to the base 2. The variety (V) of the scale (cp. e.g., ASHBY 1957, p. 126) — or, its 'nominal uncertainty' — is defined as V = log2 K. Whenever K is a power of 2, V has a very concrete, simple 1 In the area of reliability it is hardly possible to discuss the main ideas and constructs fruitfully without providing at least some information on current ways of operationalizing these constructs. By way of exception, therefore, the present section (8; 3) contains a few indispensable formulas.
262
8; 3; 1
8; 3 A C C U R A C Y A N D S T A B I L I T Y ;
RELIABILITY
meaning, viz. the number of questions that must be answered 'Yes' or 'No' — or in the terminology of Shannon's information theory ( S H A N N O N and W E A V E R 1949): the number of binary digits or 'bits' needed — to identify the class within which a measurement result falls. If there are eight numbered classes, log2 8 = 3 questions in fact suffice; e.g.: Does the element belong to one of the first four classes (Yes), or to the second four (No)? Within its set of four, does it belong to the first two (Yes), or not (No) ? Within its set of two, is it in the first class (Yes), or in the second (No)? All these questions being answered, the class has been determined. However, the question of how much information — whether relevant or not — 'one outcome provides on the average' does not depend solely on the number of classes available. It depends also on how often, relatively, each of these classes is used; or, on how often, relatively, each of the possible separate values of the empirical variable actually occurs. In many cases these relative frequencies, pi (with 2 pi = 1), for each of the classes i (e.g., with i = 1, 2, 3 ... 8), are also known. In these cases, the measure of differentiation can be refined. One measurement outcome can be viewed as an element drawn from a universe with a known frequency distribution. Obviously, the knowledge that an element belongs to a high frequency class is not very surprising, not very 'informative,' while the knowledge that an element belongs to a low frequency class is much more informative. In line with the definition of V, the amount of information contained in the knowledge that an element belongs to class i can be defined as equal to: log21/pi = — log2 pi (Example: for a relative frequency of 1 out of 8, pi = 1/8, so the outcome again equals 3 — as it should). Now, if we remember that the search was for the average amount of information provided by one outcome, it is clear we must weight the amount of information for each possible outcome (i) by its relative frequency, pi, before we add. These considerations lead directly to Shannon's wellknown formula: K
K
H =
2 P i ( - l o g 2 P i ) = - I Pilog 2 pi (1) i=l i=l The formula was originally devised for other purposes, and H has been variously called the 'uncertainty' or the 'entropy' of a 'system.' In the present context, H is the average amount of information — irrespective of relevance — provided by one measurement outcome, on a variable measured in K classes with relative frequencies pi (i = 1, 2 ... K). 8;3;1
263
8. C R I T E R I A F O R E M P I R I C A L V A R I A B L E S A N D
INSTRUMENTS
The H-formula does not provide the only possible, but certainly a very adequate, measure for the differentiation of a given scale-anddistribution. In the exceptional case where the chances are equal for all classes (that is, pi = 1/K for every i), H becomes identical to V. This is also the condition for the pi at which H is maximized. Generally, the more uneven is the relative frequency distribution, the lower is the degree of differentiation of the scale. That this decrease is real can easily be seen from an extreme instance. Suppose that pi = .9, so that for p2 through pk, together, only .1 is left (Spi = 1); then in 9 cases out of 10 the outcome will be 'uninteresting.' The average quantity of information provided by one measurement result is considerably smaller, therefore, than when the chances are distributed more evenly over the classes: the scale differentiates rather poorly. These few remarks may be sufficient to demonstrate roughly both the usefulness of the formula in controlling the accuracy of a variable — irrespective of relevance — and the importance of the underlying principles of information theory (for further study, see e.g., S H A N N O N and WEAVER 1949, ATTNEAVE 1959 and GARNER 1962). Since, for the determination of H, only the number of classes and the relative frequencies within them are used, it makes no difference whether the scale in question is a nominal, ordinal, interval, or ratio scale. The H-formula can thus be used in each of the four cases. It is only nominal scales, however, for which this operationalization of 'differentiation' can be said to cover fully the concept-as-intended. With metric scales in particular — where it is meaningful to speak of 'distances' between scale values — it seems more logical to use this additional information in determining the degree of differentiation. Hence the usual procedure to define the differentiation or dispersion of a metric scale on the basis of the (universe) mean, n x , and the magnitude of the deviations from it (Xj — ji x ). As is well-known, the most frequently used measure is the variance of the distribution: I Pi (Xj - n x ) 2 (2) i=l or its square root, crx, the standard deviation. It should be noted, however, that the variance and the standard deviation can be considered measures of the accuracy of the scale only if the class interval is set equal to unity. Only then is a x a pure measure for the degree of differentiation between classes. cj =
264
8; 3; 1
8:3
ACCURACY
AND
STABILITY;
RELIABILITY
The two differentiation measures, H and cx, are based on different lines of reasoning; they measure different things. 1 In general, the variance or standard-deviation measure for differentiation is preferable whenever the concept of distance has a real meaning and the variations in distance are to be taken into account. However, no simple rules can be given here. Apparently, even the seemingly simple question of scale differentiation — accuracy of measurement regardless of relevance — does not permit one single, simple solution. Or to put it another way, the concept 'differentiation within the scale' or 'accuracy of measurement' apparently has a surplus meaning with respect to any of its operational definitions — even if it is considered with the restriction 'irrespective of relevance.' This will become even more apparent once we drop the restriction and attempt to find out what is the 'accuracy of a given measurement'(8; 3;2), or 'of a measuring instrument' (8; 3; 3). 8; 3; 2 True value and chance error
In general, we cannot assume that all the information supplied by a measurement outcome is relevant to the purpose for which the measurement is made. In trying to grasp what is meant by 'relevant' in this context, it may again be helpful to conceive of a score value for a variable as a message which is transmitted by means of the corresponding instrument, from 'sender' to 'receiver.' The question to what extent the information received is relevant will then correspond with the question to what extent the message received conforms to the 'true' message, i.e., the message sent. In information theory, two kinds of transmission errors are distinguished : distortions and noise. Whenever a systematic error is contained in the transmission procedure, that is, when the messages sent and received are not identical but do exhibit a certain systematic, functional relationship, then the term 'distortion' is applied. 'Noise' is the term used when no such functional relationship is assumed: the errors in transmission are considered random errors. This very formulation shows that the distinction between distortion and noise in information theory corresponds with the one between systematic and random errors in statistics. 1
For the mathematical relations between H and a x we refer to S H A N N O N and 1949, pp. 54-56 and ATTNEAVE 1959, pp. 95-96. Apart from V, H and crx, other measures of differentiation exist and are preferable for some types of variables, scales and/or distributions, but these cannot be dealt with here. WEAVER,
8; 3; 2
265
8.
CRITERIA
FOR E M P I R I C A L VARIABLES A N D
INSTRUMENTS
We shall not here be concerned with distortions. By definition, they are due to a systematic error which may lie either in the wider complex of the experimental design — outside the responsibility of the particular instrument, so to say — or in deficient (construct) validity of the instrument. If, for instance, there is reason to assume that scores on a questionnaire intended to measure 'authoritarianism' ( A D O R N O et al. 1950) are influenced by a tendency of respondents to assent to rather than to deny printed statements ('set to acquiescence,' cp. e.g., B A S S 1955), then this is a shortcoming of the instrument, which impairs realization of the intended construct (authoritarianism): validity rather than accuracy of measurement is at fault. On the other hand, random errors, such as will arise, for instance, when the respondent answers 'yes' to a certain item when he might as well have said 'no,' or when he makes a slip in writing down or checking the answer, are to be considered 'noise.' They may be regarded as resulting from deficiencies in the accuracy of the instrument. In terms of measurement, the question is to what extent the obtained value of the variable, in the given scale, differs from the true value, as a result of random errors. The 'true value,' or in quantitative (metric) measurement, the 'true score,' corresponds with the message sent; the 'obtained value' (or 'obtained score') with the message received. These notions are clearly meaningful when applied to methods (instruments) that measure, estimate, or approximate attributes of objects where there is no reason to doubt that they 'have' a true value. This condition is fulfilled, for instance, in the case of distances, dimensions, numbers of objects or events, measures of time or volume, or, to give a nominal example, identification of species in zoology. In these cases the operational definition of the variable as embodied in the instrument is regarded as a method of approximating a true value which can, if perhaps not actually, at least in principle be determined by a more direct, more accurate method, and which 'exists' in any event. What can be done to establish and counter the effect of random errors if some more precise method of measurement cannot be used? The answer is known from measurement in physics: the effect of random errors — not of systematic distortions — can be decreased statistically by repeating the measurement a number of times. Repeated measurements serve, first, to check whether the observed deviations are in fact random. If so, the distribution of the obtained outcomes must fit — or rather, not be significantly at variance with 266
8; 3; 2
8; 3 A C C U R A C Y A N D S T A B I L I T Y ;
RELIABILITY
— the theoretical distribution which best describes error fluctuations in the given case. There must be one central tendency — determined by the true value — and the scatter around it must conform to the pertinent error model. With measurement in interval scales, the Gaussian model is often adequate; i.e., the outcomes must be grouped according to a normal distribution around the — unknown — true score. Admittedly, the reverse conclusion, that from normality of distribution to chance as 'cause,' is not always warranted; but there is in any event some form of control. If it is then assumed that the deviations are in fact due to random errors, there is, secondly, the possibility of obtaining a better approximation to the true value by taking a large number of measurements. For the interval scale, as is well-known, the mean of obtained outcomes generally provides the best approximation. The degree of approximation to the true value can be enhanced by enlarging the number of repeated measurements to be averaged. If one attempts to apply this approach to measurement in the social sciences, two peculiar difficulties are encountered. First, the 'true value' can often neither be determined without the instrument in question nor even properly defined independently of the instrument. Or, to put it another way, not only can the 'message sent' be approximated only by means of the message received through one particular mode of transmission, but the very assumption that there is a 'true' signal sent can frequently not be justified. Thus, leaving the stability problem (8; 3;4) aside for the moment, it seems hardly meaningful, at first sight, to ascribe to a subject a 'true' Wechsler intelligence at the time of testing, different from his score on the test. On the other hand, we cannot but assume that in this type of measurement, too, the end result, the obtained value of the variable, is in part determined by chance elements. So, in a case like this (with an interval scale score), we nevertheless assume that the: obtained score = 'true score' + error score. The question, now, is how to define this 'true score' — or, generally, the 'true value' of a variable for a given object of measurement. Mostly, it is defined through repetitions of the measurements themselves. In working out this idea, let us again restrict ourselves to metric scales and the normal distribution model. On the assumption of random errors only, here, as in the measurement of distances etc. (cp. p. 266), the best 8; 3; 2
267
8.
CRITERIA
FOR E M P I R I C A L
VARIABLES AND
INSTRUMENTS
approximation must be provided by the average of the largest possible number of measurement results. The assumption of 'random deviations from some true score' only, then, implies that with an increasing number of repetitions this average must converge to a limit. So we can say that the ' true score' is by definition the limit of the mean score. In a formula, if Xi m is the m-th measurement result for object i: j 1 M True score, Tj = lim — £ Xi; M-»oo
M
(3)
m = 1
Once the 'true score' has been defined, the error score, Eim, is; (4) The degree of unreliability or the standard error of the measurement of object i can now be defined as the standard deviation crEl of the E im scores. In weaker scales — nominal, ordinal — the operation of taking the mean is unwarranted, so that this definition cannot be used. In principle, however, there too the extra information yielded by repeated measurements can be used: first, to check whether there are in fact random errors; secondly, to obtain greater certainty in determining, respectively defining, a 'true score'; thirdly, to formulate a probabilistic criterion for the unreliability of an obtained measure. The reader will no doubt have wondered how the second characteristic difficulty of measurement in the behavioral sciences — thus far not mentioned — is to be solved. We refer of course to the difficulty that measurements of object i can in fact hardly ever be equivalently repeated (M times). The main problem is not that M cannot actually be made to approach infinity — that can be taken care of by approximation methods (for Tj and a E i ) — but that M often does not get off the ground at all. First of all, many instruments are organized to determine a particular condition of an individual, a situation, or of, say, public opinion here and now. If one attempts to measure progress in a learning process, group tensions, or political opinions and attitudes, one does not assume that these are stable characteristics. Repeated measurements of such behavior variables are impossible, for the simple reason that the objects to be measured change in and of themselves with time. Secondly, the process of measurement itself often causes an irreversible change in the objects (experimental subjects, respondents, groups): they 'know the 268
8; 3; 2
8; 3
ACCURACY
AND STABILITY;
RELIABILITY
test,' are on guard now, take a different view, or do not 'really have to try' any more. Because of all these difficulties, empirical approximations to Tj andCTEiaccording to the above formulas will but seldom be possible. Frequently, measurement by a particular instrument can at most be repeated once (M = 2); sometimes even that is impossible. Nevertheless, as we shall see in the next section, the above line of reasoning makes it possible to arrive at practicable measures to determine the reliability of instruments. 8; 3; 3 Measures for the reliability of an instrument
In 8; 3; 1 we have seen that an instrument measures the more accurately, 'the more relevant information, on the average, one outcome provides with respect to the value of the corresponding variable.' In the same section we have studied briefly, from the viewpoint of information theory, the degree of differentiation within the scale used — irrespective of its 'relevance.' In 8; 3; 2 we have made clear, for the case of one object of measurement, what is to be understood by 'relevance' of information in this chapter. Our concern is emphatically not with the importance or meaning of the variable but with the isolation and possible elimination of pseudo-information caused by chance fluctuations. In measuring an attitude of a single object (subject) in the behavioral sciences such isolation is seldom found to be possible, it is true. But this fact need not disturb us, since we are searching for a criterion for the reliability of the instrument; i.e., for the reliability (relevance) of measurement results obtained with this instrument 'on the average' (see the above definition). The problem is how to conceptualize, operationalize, and estimate the relative amounts of true (relevant) and pseudo-(error) information which the instrument in general provides while at work in its proper universe. Let us assume, first of all, that M = 2 is possible. That is, we shall suppose that the measuring procedure can be repeated once, and that the objects have neither in the meantime changed in and of themselves nor are still affected by the first measurement. In research practice this may be the case with a multiple choice test or questionnaire designed to measure a (metric) attitude or interest variable from a set of questions which must be answered rapidly. If the number of questions is large enough and if they were answered at a sufficiently rapid pace, the subject may be assumed on a retest after, say, a month to remember little or 8; 3; 3
269
8.
CRITERIA
FOR EMPIRICAL
VARIABLES AND
INSTRUMENTS
nothing of what he did the first time. If so, his second set of responses should again be spontaneous. The two measurements can be regarded as experimentally independent. Secondly, it is assumed that the subject's attitude or interest — the attribute to be measured — has undergone no true change in the relatively short period between test and retest. On these assumptions, the true score (Ti) of any subject i must remain the same. As a result, the difference — if any — between his two obtained scores (xu and X12) must be due to error fluctuations. In the hypothetical case of error-free measurement, the two must be the same; in the normal case, they may more or less differ. Our concern is now with how much they are in general at variance with each other, say for N subjects (i = 1 , 2 . . . N), relative to the score variations due to what the instrument is supposed to measure, namely true score differences among subjects. If, at this point, we turn to the variance definition of the total amount of differential information (formula (2) in 8; 3; 1), the 'relative amount of true (not pseudo-) information' sought for in the X-scores, generally, is identical with the relative amount of 'true variance' in the total variance of variable X. Now, under certain assumptions (see below), this relative amount is given by the coefficient of correlation between the two sets of scores, Xu and Xi2 (i = 1, 2 ... N). In a formula 1 : rxx -
Oj /
CTJ
(5)
Correlation coefficients of this type are accordingly called reliability coefficients. In the hypothetical case of errorless measurement (and differing true scores) the reliability coefficient obviously equals unity. Generally, it will be larger (closer to + 1) insofar as the effect of random fluctuations is on the whole smaller. The reliability coefficient of an instrument can, on the above assumptions, be defined as the magnitude of this correlation in its proper universe. An estimate of this universe correlation can be obtained by calculating it for a sample of, say, N subjects. It should be obvious that the reliability of this estimate depends, just as in the case of validity coefficients, first, on the degree to which the sample is representative and secondly on its size (N). And again (cp. 8;2;1, p. 248), both the computation and the estimation of the proper universe value of a reliability coefficient, rxx, may present diffi1 In the present discussion of the rationale of reliability measures the symbols r and a are used throughout, i.e., for both universe and sample values.
270
8; 3; 3
8; 3 A C C U R A C Y A N D S T A B I L I T Y ;
RELIABILITY
culties that call for corrections. For, these, we again refer to handbooks of psychometrics (e.g., GULLIKSEN 1950; LORD and NOVICK 1968). Apart from its interpretation as the 'relative amount of true variance in a variable' — and thus apart from the underlying assumptions, which have not yet been specified — the reliability coefficient as defined here has an obvious operational meaning. It presents a direct estimate of the degree of (linear) agreement to be expected in the universe when the same variable, X, is measured twice. This is often quite informative in itself. If, however, further derivations are made, in particular if for a variable the standard error of measurement is computed, the assumptions underlying formula (5) on p. 270 become critical and must be taken into account. They are, first, that the error scores (Ei), on the first and the second occasion, are not correlated (r ElE , = 0); and, second, the more debatable assumption that error scores do not correlate with true scores (r TE = 0). Only if they are fulfilled, are true variance and error variance simply additive, that is: