Contemporary issues in educational testing 9783111557045, 9789027975218


227 102 22MB

English Pages 344 [352] Year 1974

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Foreword
Contents
Part 1: National evaluation of educational systems
1. The problem of evaluating national educational systems
2. National evaluation of educational systems: A recent study on the evaluation of 20 educational systems by the International Association for the Evaluation of Educational Achievement
3. National evaluation of educational systems: New roles and new methods
4. National Assessment of Educational Progress: The measurement of change in achievement
5. Foundation, functions and future of CITO, the National Institute for Test Development in the Netherlands
Part 2: New tests: Construction and use
6. The criterion-referenced test: Its meaning and difficulties
7. On the evaluation of criterion-referenced tests
8. Procedures and issues in the development of criterion-referenced tests
9. Criterion-referenced assessment
10. Moderation of a school-based assessment in integrated studies by means of a monitoring test
Part 3: New approaches to testing: Formative evaluation and diagnostic
11. The role of values in evaluation: Methodological strategies for educational improvement
12. Formative evaluation and educational practice
13. Formative evaluation : Interpretation and participation
14. The effects of formative evaluation on student performance
15. Testing as teaching
Part 4: Testing to fit new teaching techniques: Grouping and individualization
16. Aptitude-treatment interactions in educational research and evaluation
17. ATI research: For better psychological insights or for better educational practice?
18. Product and process in teaching and testing
19. Improving aptitude and achievement measures in the study of aptitude-treatment interactions
20. Carroll's model of school learning as a basis for enlarging the aptitude-treatment interaction concept
21. Student initiated testing: The base for a new educational system
Appendix 1: Abstract of submitted contributions not published in full
Appendix 2: List of participants
Name index
Subject index
Recommend Papers

Contemporary issues in educational testing
 9783111557045, 9789027975218

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

ERRATA

Contemporary Issues in Educational Testing Edited by Hans F. Crombag and Dato N. de Gruijter

Name index: Barten, K., 217 Blankenship, 311 Clearly, T., 131, 135 Hieronymus, A.N., 206 Kagan, I., 223, 225, 276, 279 Millman, I., 125, 133, 136

Should be: Barton, K„ 217 Blankenship, M. E., 311 Cleary, T„ 131, 135 Hieronymus, A.N., 106 Kagan, J., 223, 225, 276, 279 Millman, J., 125, 133, 136

Subject index: contingent coefficient, 111 Instructional Objectives, Exchange, 144, 147 ®-coefficient, 114 variability of scores, 142, 243

Should be: contingency coefficient, 111 Instructional Objectives Exchange, 144, 147 (p-coefficient, 114 variability of scores, 141, 143

Contemporary Issues in Educational Testing

Psychological Studies

Editorial J O H A N T.

Comittee BARENDREGT

H A N S C. B R E N G E L M A N N / G O S T A

EKMANf

S I P K E D. F O K K E M A / N I C O H. F R U D A J O H N P. V A N DE G E E R / A D R I A A N D. DE G R O O T MAUK M U L D E R / R O N A L D W I L L I A M RAMSAY SIES W I E G E R S M A

Mouton

• The Hague



Paris

Contemporary Issues in Educational Testing edited by H A N S F. CROMBAG and DATO N. DE GRUIJTER

Mouton

• The Hague

• Paris

ISBN 90 279 7521 3 © 1974, Mouton & Co, The Hague, Netherlands Printed in the Netherlands

Foreword

This book contains the proceedings of the International Symposium on Educational Testing, held at The Hague, the Netherlands, July 16 to 19, 1973. A little over 150 participants came from 16 different countries to discuss the state of the art in educational testing. The symposium comprised four whole-day sessions. During each morning three invited speakers gave their addresses in plenary sessions. Each afternoon participants presented submitted papers in two or three parallel sessions. Twelve invited speakers and over 45 participants contributed actively to the symposium. Four topics were chosen for the symposium, corresponding to what we considered to be the most promising new or relatively new approaches in the area of achievement testing. The four parts of this book correspond to the four symposium topics. Not all papers delivered during the symposium are published in full in this book. Since we found it extremely difficult to evaluate on the basis of abstracts the relevance of submitted papers to the symposium topics, we have practiced a liberal admissions policy. As a result, we got too many and too heterogeneous a collection of papers to be published in one book. Only the 12 invited contributions, and a very small selection of submitted contributions fitting in strictly with the symposium topics are printed in full. Of all other contributions, the abstracts, as published prior to the symposium in the book of abstracts, are reprinted in this book. These, together with the addresses of the authors to be found in the list of participants, should enable readers to request copies of these contributions directly from the authors.

2 Foreword We wish to acknowledge the contributions to the success of the symposium of Mr C. A. Chorus, executive director, Miss B.M. Broekmeijer, executive secretary, and Miss A.M. Giorgi, executive treasurer of the symposium. Finally it should be mentioned that Mr Joseph Froomkin played an important role in planning the symposium. He also served as the chairman of the first day's session. Leyden, October 1973 Hans F. Crombag, program chairman Dato N. de Gruijter, associate program chairman

Contents

Foreword Part 1: NATIONAL EVALUATION OF EDUCATIONAL SYSTEMS Adriaan D. de Groot 1. The problem of evaluating national educational systems T. Neville Postlethwaite 2. National evaluation of educational systems: A recent study on the evaluation of 20 educational systems by the International Association for the Evaluation of Educational Achievement Richard M. Wolf 3. National evaluation of educational systems: New roles and new methods J. Stanley Ahmann 4. National Assessment of Educational Progress: The measurements of change in achievement Djien K. Thio 5. Foundation, functions and future of CITO, the National Institute for Test Development in the Netherlands Part 2: NEW TESTS: CONSTRUCTION AND USE Sten Henrysson 6. The criterion-referenced test: Its meaning and difficulties

4

Contents

Ingemar Wedman 7. On the evaluation of criterion-referenced tests

108

Stephen P. Klein 8. Procedures and issues in the development of criterion-referenced tests

119

Charles W. Smith 9. Criterion-referenced assessment H.G. Macintosh 10. Moderation of a school-based assessment in integrated studies by means of a monitoring test

138

148

Part 3: NEW APPROACHES TO TESTING: FORMATIVE EVALUATION AND DIAGNOSTIC Samuel Messick 11. The role of values in evaluation: Methodological strategies for educational improvement

165

Wynand H. Wijnen 12. Formative evaluation and educational practice

178

Roy Cox 13. Formative evaluation: Interpretation and participation

192

Marshall N. Arlin, Jr. 14. The effects of formative evaluation on student performance

203

Abby G. Rosenfield, William Pizzi, Anthony Kopera, Frank Loos 15. Testing as teaching

218

Part 4: TESTING TO FIT NEW TEACHING TECHNIQUES: GROUPING AND INDIVIDUALIZATION Richard E. Snow 16. Aptitude-treatment interactions in educational research and evaluation

229

Gavriel Salomon 17. ATI research: For better psychological insights or for better educational practice?

248

Contents

5

Hans F. Crombag 18. Product and process in teaching and testing

256

Mary Lou Koran 19. Improving aptitude and achievement measures in the study of aptitude-treatment interactions

272

W. George Gaines & Eugene A. Jongsma 20. Carroll's model of school learning as a basis for enlarging the aptitude-treatment interaction concept

281

Michael W. Allen 21. Student initiated testing: The base for a new educational system

290

Appendix 1: Abstract of submitted contributions not published in full

297

Appendix 2: List of participants

329

Name index Subject index

337 341

PART 1 :

National evaluation of educational systems

The use of standardized procedures for achievement measurement has become so widespread that it becomes useful to collect data concerning educational outcomes on a nationwide level in order to make national characterizations of educational effectiveness and even international comparisons. Different national and international approaches will be discussed.

A D R I A A N D. D E G R O O T

1

University of Amsterdam

The problem of evaluating national educational systems

NEED FOR A THEORY OF GOALS

Purpose questions Lately, I find myself more and more often asking a particular kind of question, namely: To what purpose are we doing what we do ? It is a type of question that keeps one busy, if nothing else. For it can of course be applied to any activity - for instance, to the activity of educating young people, or that of maintaining or improving an educational system. In fact, the latter application will be my main theme today, i.e., the theme of goals and objectives of, and in, a national educational system - or NES as I shall now and then call it. But then, the question has a tendency to recur: To what purpose do we want that badly to set goals and to formulate objectives? If it is that we need them for evaluation, to what purpose,then, do we want to evaluate national educational systems? And so on. Purpose questions such as these must be distinguished sharply from questions regarding motives. For instance, I shall not go into the problem why - to what 'purpose' if you want - 1 am in the habit, more and more so, of posing purpose questions. Psychologically this type of problem might be quite interesting. For instance, a tendency such as this one could be ascribed to the approach of old age - when so-called 'last questions', as they are generally known, become more and more intriguing, and subjectively important. In addition, other psychological motives or tendencies of a personal nature might be hypothesized - but this is not what we are searching for.

10 Adriaan D. de Groot Purpose questions, as I use the term here, are not directed towards the detection of psychological motives, neither personal nor group motives; generally speaking, they aim at workable, rational, explicit goal formulations, to be agreed upon by - or acceptable to - all concerned. Why evaluate an NES? Why, then, and to what purpose do we want to evaluate national educational systems? A few introductory remarks about this are in order. The answer is by no means self-evident. I shall not go into the recent history of the idea of evaluation - Dr Wolf will have to say a few words on that topic. But speaking for the Netherlands - and possibly for other European countries - I would like to say this: Only some ten or fifteen years ago the proposition of a full-fledged, comparative evaluation study with regard to NESs would have raised no interest whatever. The Dutch system was then generally thought of, in the Netherlands, as something existing, and unalterable; if not exactly God-given, it was at least, culturegiven and history-based. Many people in education felt, in addition, that we had better be happy with the culture we had, that our NES was among the best in the world, and that our main task was, accordingly, to maintain its high quality - or even to defend it against the attacks of 'the politicians'. And even those who were more critically minded, shared with the glorifiers of the system the conviction that it would be all but impossible to systematically change any of its fundamentals. Planned change being out of the question the possible practical, national benefits of evaluation studies were difficult to see. Generally, the way in which national educational systems develop and function was then supposed to roughly depend on two groups of factors: first, historical, cultural and political forces which could not be influenced by educationalists; and second, down-to-earth factors such as money, facilities, good teachers, modern technology and plain 'good organization', which need not be studied since it was clear that they just had to be maximized-the more, the better. As a matter of course, one could help developing countries in building up or reorganizing their NESs - but even to that purpose little could be learned, it was thought, from comparative evaluation studies as we have them in mind now. The situation has changed considerably, in large part due to the IEA studies we shall hear more about at this conference. They were eyeopeners, if nothing else.

The problems of evaluating national education systems

11

Now this statement implies a double answer to our purpose question. Apparently we now take for granted that comparative evaluation of NESs will serve two purposes. First, it is supposed to lead to relevant new knowledge about how educational systems function. Secondly, this knowledge is supposed to be practically fruitful and useful: it can be applied in national programs of system innovation. Two underlying assumptions Consequently, evaluating NESs will be a rewarding activity to the extent the two purposes are attainable at all, and can be practically realized. This depends largely on the validity of the two underlying assumptions which have been stated in the summary. Phrased somewhat differently, the first assumption amounts to the question whether we will be able to get hold of the really fundamental factors and variables - other than just 'maximizers' - which 'make the difference' in the functioning of national educational systems. In other words, shall we be able to design a good, informative and effective theory of how NESs work? The second assumption is - the summary has it - 'rather one in the nature of a conviction'. It supposes that planned, rational change isfeasible, and will be feasible in the future, 'even with regard to a history-based, politically sensitive, and highly complex organization, or organism, such as a European NES'. As regards the Netherlands - with its confused and confusing political process, ruled by short-term compromises between interested parties up to the smallest details - this conviction may be overly optimistic. But this is a local problem only, I hope. Full cycle includes theory formation Knowing now what we want to do, or want to be done, and to what purposes, let us take a closer look at the problems involved in evaluating national educational systems. As we have seen, NESs are rather new objects for evaluative efforts; but the problems to be solved in empirical studies, 'full-fledged' studies as I have previously called them, in which the value of some complex operation must be estimated on the basis of its outcomes, those problems are not that new. I do not plan to go deeply into the methodology of such studies, but I would like to stress two general points.

12 Adriaan D. de Groot First, evaluation studies - empirical, summative evaluations of complex operations (activities or systems) - generally require the full cycle of scientific research. In the summary, the five phases or elements of this cycle, the empirical cycle of hypothesis testing, have been formulated. I shall not now repeat those formulations. They can of course be phrased differently, and they have been phrased differently by others. The point is, however, that in evaluation studies all of the things mentioned must be done by the evaluation researcher himself. For hardly ever are objectives pre-formulated, satisfactory criteria known in advance, hypotheses available in workable form, operationalizations, designs and data processing methods ready-made. Most important: theory formation is part of the evaluator's job. 1 Secondly, there is a good general reason why the aspect of theory formation deserves our special attention. It is mentioned in the summary. While we, behavioral scientists generally, have become masters in handling those phases of the process where hard criteria can be set - logical, empirical, and statistical criteria - we have remained rather weak theorizers and interpreters. In phases (1) and (5), theory formation and interpretation of outcomes, respectively, 'the criteria of quality, and of importance (relevance), remain necessarily less explicit and more flexible'. As a result, 'quality control' - by the scientific 'forum', or the community of researchers - is much less effective here than it is in phases (2), (3), and (4). 'Accordingly, the problem of adequate theory formation (and interpretation) is the crucial one.'' In other words, in evaluation research, the lasting scientific as well as the practical values of research outcomes are likely to depend more on the quality of basic concepts and theoretical propositions than on anything else. PREREQUISITES AND INGREDIENTS OF A GOOD THEORY

Avoiding thought-avoiding habits As regards national educational systems, we all are likely to agree with the statement that we do not have much 'good theory' available. The 1. This latter point, by the way, was emphasized twenty years ago, most straightforwardly and brilliantly, in a very different field, namely that of the evaluation of counseling and psychotherapy (Rogers & Dymond, 1954).

The problems of evaluating national education systems

13

problem is, then, how it can be constructed. Of what ingredients does good theory consist, and how does it come about? My general answer to the latter question would be something in this vein: In a case like ours, NES-theory required, good theory can only come about by broad orientation, hard thinking, patient search for the best possible conceptualizations, and in particular great care not to fall into the many traps set by our own thought-avoiding habits. We all know that we have many such habits. Most of them derive from, or are at least closely associated with, our success stories; that is why the habits stick and the traps appeal to us. I am of course referring to our simplifying schools of thought, methodologies (or rather technicologies) or '-isms'. Many habits, and pitfalls, still derive for instance from the behaviorist disregard of mental processes and contents; others from the measurement ideology with its tendency to abstract from unmeasurable things, as if they did not exist. In this context operationalism, too, must be mentioned, as well as some of the presently fashionable model or systems approaches, with the tendency to apply them as 'straightjackets to educational reality', as the summary has it. I do not want to be misunderstood. In general, I am by no means 'against' those techniques or even against the pertinent ideologies. The only point I want to make is that by their pre-set abstraction from whole categories of data they are likely to mislead us if our purpose is to design a good encompassing theory of national educational systems. One other current simplifying tendency of the same type but from a rather different source must be registered, namely the tendency to disregard whole bodies of data which for some reason are politically unwelcome. At present, this tendency can be observed in particular in approaches to problems which deal with individual differences. Again, I have nothing against politics. But here as well as in the case of the previously mentioned ideologies, it must be stated that really good theories of complex subjects such as NESs cannot be designed on a basis of discrimination, i.e., discrimination against certain kinds of data and ideas. Theory and methodology for agreement I shall not go on in this general, critical vein. Rather, I would like to discuss very briefly some of the ideas and conceptualizations we are now trying to work out in our institute, the RITP, as possible contributions to a theory of educational systems.

14 Adriaan D. de Groot The project is called Onderwijssysteem (Educational System) and is supported by the National Foundation for Educational Research, S.V.O. One part of the project can be considered an attempt to give an answer to the crucial 'purpose question' - an NES, to what purpose? - by developing a sort of general theory, and methodology, of educational goal analysis. At this point, the purpose question is likely to re-enter our mind in another form: Why such a theory? In particular: Are there no simpler ways than that of theory development to answer the question of what an NES is good for? And: Is this not a way to get hopelessly involved in politics? These questions lead up to the first point I would like to make. The reasoning we follow is hardly new and really quite simple. In a democracy, educational goals are to be decided on, in principle, by some sort of agreement. Neither educational experts, nor teachers, students, school boards, parent associations, government officials, taxpayers' representatives, or political parties are supposed to one-sidedly set the goals (and priorities) of an educational system - nor those of a sub-system, a school, or a particular program. However, the way in which in actual practice agreements are reached - z/this is an adequate description of the process at all is far from satisfactory. This is largely due to the apparent complexity of the problem, a complexity which in turn is largely due to the absence of a generally accepted conceptual framework in terms of which goals can be discussed, priorities negotiated, and agreements reached. Consequently, the situation can be greatly improved by a good theory of educational goals. Such a theory can and should provide a transparent model of the hierarchical structure of goals and subgoals we are acquainted with since Benjamin Bloom's taxonomy (Bloom, 1966), and a corresponding, logically consistent terminology. If agreed upon, the theory can provide a rational basis and a common language for goal discussions, and thereby make for better decisions.

Understandability and acceptability requirements The crucial words in the previous sentence, and in the whole argument are, of course: 'if agreed upon'. The point is that according to this line of thought, a good theory, by definition, can be agreed upon, i.e., it must be conceived and phrased in such a way that its concepts and terminology are (potentially) acceptable to all parties concerned.

The problems of evaluating national education systems

15

This is a somewhat uncommon requirement. It is based on the consideration that a good theory of goals should be useful not only for scientific purposes - generating testable hypotheses, etc. - but also for practical purposes of communication and democratic decision making. Its basic concepts should be both understandable and acceptable to teachers and politicians, to mention two of the most important parties. The theory should be politically useful without being - or even, precisely by not being - 'political' in nature. This extra requirement is essential for the theoretical conceptions we are trying to develop. Before getting back to the main track of our argument, one side remark: it might be a good idea to generalize the acceptability and understandability requirements to other scientific theories. Apart from the function of providing new knowledge, to be understood primarily by our peers (other sophisticated scientists), and that of inventing new methods, to be understood primarily by technicians who can apply them, social scientific theorizing generally has a third mission we might better and more explicitly attend to: that of promoting rationality in human communication by providing the larger community with theory they can understand, i.e., with simple, adequate and consistent ideas, concepts, and terminology. Implementing the tree of goals and subgoals I am afraid there is no time to say much more about the construction of, or rather the search for, that skeleton of a theory of goals - the goal 'tree', the hierarchical structure of goals and subgoals - than the few words devoted to the subject in the summary. Results of the search may have the form of propositions about parts of the tree, or of construction principles. In any case, in order to contribute to a 'good theory' such results must be, as we have seen, 'understandable and acceptable'. These requirements can be specified, and if needed strictly operationalized, by working, as we are trying to do, with samples of so-called 'relevant respondents'. As could be expected, the requirements prove hard to meet fully. But there are three good reasons not to give up. First, this is one of those goals which are too important to give up easily; in particular, we do not want to fall into one of those 'thought-avoiding traps'. Secondly, every step towards the goal, even if it is never fully attained, means gain. Thirdly, and this is the theorist's satisfaction in the enterprise, the two requirements

16 Adriaan D. de Groot 'understandability and acceptability', meant to serve rational agreement as they are, include the traditional requirements for a good theory. The latter point could be illustrated by going somewhat deeper into the subjects mentioned in the summary: the distinction and the relationship between policy goals for systems and subsystems, as set by boards and governments, on the one hand, and educational objectives, as set by educators, on the other; the problems of how to implement and to operationalize goals, and the need for three types of operationalization: budgetary, organizational (or strategic), and educational (or evaluative); the pertinent question of the validity of those operationalizations; and, last but not least, the problem of how to devise an understandable and politically acceptable set ofpolicy goal formulations to serve rational agreement. The latter case may provide the best illustration. For first, of course, we do not want more different policy goals than needed; secondly, together they should satisfactorily cover whatever politicians may want from an educational system, or want to weight heaviei than others do, in an educational system. In other words, the acceptability requirement implies certain requirements of validity and sufficiency, while the requirement of understandability leads naturally to requiring simplicity or economy of the set of goals. Indeed, all this begins to look like theory construction.

THE COVERAGE PROBLEM

Other things learned? From here it is only one step to the next problem. Whenever empirical evaluation of NESs is considered, a crucial problem is that of the sufficiency of the set of educational objectives as measured in such a project. In the late 1950s, while spending a lost half hour in a library, I happened to hit upon an article in which a simple but, I think, well-designed early comparative study was reported. I did not make notes, and I can not now recall the author nor the periodical, but I do remember the subject, the main results, and a particular comment in the discussion. Scottish and Californian children were compared in their achievements in arithmetic; and the Scottish sample beat the Californian sample by about a full standard deviation. The author's comment then, ran somewhat like this: 'So, Scottish children are better in arithmetic than Californian children

The problems of evaluating national education systems

17

are. In this respect they have learned more in school. It remains to be investigated, however, what other things Californian children may have learned, instead.' My first reaction at the time was: this reads like a poor excuse. On second thought, however, the author's implicit assumption that those Californian children are likely to have learned something that will compensate their lack of skill in arithmetic, did not bother me any more. Let us just delete the word 'instead' and replace 'Californian' by 'Californian and Scottish'; then we have a succinct formulation of a formidable problem: 'It remains to be investigated what other things Californian and Scottish children may have learned.' The sum total of subject matters This is the coverage problem, or, the problem of the sufficiency of the set of empirical criteria we have to our disposition. If we do not have ways of measuring or estimating the sum total of all relevant school-learning effects, there may always be unknown 'other things students have learned, instead'. 'Relevant' learning effects are, of course, effects which correspond to accepted educational goals. The crucial question is, therefore, that of whether we are able to cover the goals with the total battery of our measuring instruments. Posing the question is answering it: we all agree I suppose that we are not, as yet, in the position to realize or even to approach such coverage. This is an uneasy problem which is likely to recur in all empirical studies which aim at comparative NES evaluation. Even if the problem is simplified by considering only those cognitive objectives which correspond to the subject matter of the school program, it is a hard one. In different countries programs differ. In the Netherlands and Belgium, for instance, a substantial proportion of the secondary school population leain three foreign languages, up to a comparatively high degree of proficiency. What do children in other countries learn instead, and how is this to be compared and weighted? I feel personally that this restricted problem will prove solvable if only we devote enough energy to it. And I feel that it is essential to solve it. Partial outputs of a system must be viewed in the light of its total output. We shall have to construct some reasonable measures or estimates of the sum total of subject matter learning effects for comparable school programs in different countries.

18 Adriaan D. de Groot Including non-cognitive objectives Much more perplexing is the same 'sum total problem' if we go beyond the so-called cognitive domain. It is generally claimed that school education should contribute to the foimation of positively valued attitudes and to the development of personality; and it is also claimed that it does so. If these claims are taken seriously, as they should, the present state of the art in the so-called affective domain gives little hope that we shall soon have anything like an acceptable total effect measure in this broader sense. Consequently, given any finding of difference in any comparative NES study, the way out - 'Yes, but students may have learned something else instead' - remains wide open. In my opinion, the crucial problem here is not one of measurement technique, as some would maintain, but rather, again, one of theory. As a fundamental element in a 'good theory' of goals, an acceptable definition of the total set of educational objectives is needed. The general question here is the one to which the division into 'domains' - cognitive, affectiveattitudinal, and sensori-motor - tries to give an answer: Which categories of effects of school education must be distinguished, so that the resulting set of types of objectives covers economically what we all want good schooling to do? Whether applied to an NES or to a subsystem - a school, a program - such an acceptable definition, in the form of a set of types of objectives, would provide the indispensable basis for agreement. Its availability would make it possible for teachers, officials, and experts (educational researchers included) to be satisfied that nothing is missing, neither in the cognitive nor in the so-called affective domain. In view of the differences of opinion between measurement specialists and many educators regarding the measurability of some of the supposedly most important educational objectives, the idea that such a basic agreement could be realized may sound like a Utopian dream. I would like to submit, however, that the ideal is not that far out of reach if only we are prepared to revise some of our theoretical conceptions regarding behavioral goals. In order to give you some idea of what, in our project, we are trying to develop, I would like to present a few propositions. A NEW APPROACH: FOUR PROPOSITIONS

Not behaviors conditioned, but programs acquired First an exclusion for the sake of caution: The propositions do not refer

The problems of evaluating national education systems

19

to the teaching and learning of specific sensori-motor skills which in some respects may require a different approach. Proposition 1, then, is that the task of school education is never that of conditioning students. Instead, the task of school education is always to help students acquire behavior dispositions or programs they may or may not make use of, according to their own judgment. Comment: It will be clear that I am not a Skinnerian - but that is not the point. Apart from the merits of conditioning methods for other purposes, and apart from the social conditioning processes implicit in what happens in schools, I feel we can all agree that among the explicitly stated objectives of general school education there is no room for objectives of the conditioning type. We do not want students to react with 'learned behavior' without insight, without knowing what they are doing and why, and in particular, without the explicit freedom to react otherwise. I feel this is an acceptable proposition. It should be even quite welcome to most educators who in their work tend to emphasize the great importance of using one's own judgment. If this point is accepted, it has a number of far-reaching consequences.

Stored and available for conscious use Proposition 2. Much of our current terminology, and in particular expressions like 'entry behavior', 'learned behavior', are misleading since they misrepresent what is learned and is to be learned in school. Instead, we had better talk, for example, about repertoires or arsenals of behavior dispositions, of knowledge possessed, of skills mastered (mental skills included) or possibly of habits acquired - or rather, briefly and generally: of stored programs, the student can use and steer himself. Comment: Actually, teachers do not teach, nor do students learn, 'behavior'. Terms like 'learned behavior' may be useful in fields like ethology - for instance when 'innate' and 'learned behavior' of birds are contrasted - but they are not adequate in the discussion of educational objectives. It will be clear that the avoidance of 'dispositional' concepts, such as skills and habits 'possessed', still derives from the old behaviorist taboo. For, what is 'possessed' by the student is in some way 'in his mind'. It is true that the influence of the taboo has diminished, in particular since the rise of the new alternative, the information processing approach, but it is still there. We want students to use what they have learned in a rational

20 Adriaan D. de Groot manner and with good judgment, that is consciously - but most educational experts would never say that. Attitudinal objectives reduce to cognitive ones Proposition 3: The element of freedom as emphasized in the preceding propositions - the freedom to use or not to use one's own programs excludes so-called attitudinal and emotional 'behaviors' from the list of acceptable educational objectives. Instead, only the cognitive infrastructures of positively valued attitude, emotional, and generally, personality developments can be listed: only the pertinent 'know-hows' can be taught, learned, and required. Comment: Apart from the fact that we can not rightfully require a student to develop a certain attitude or a certain emotional sensitivity without hurting his freedom, we have no valid means of checking whether the attitude is real or faked, the emotion authentic or simulated. Education can provide a student with instructive experiences, with the means to develop sensitivity, with social insights and knowledge, with the know-how needed to implement a certain attitude - but then it has to hope the best for his personality development, which remains his or her own business. It will be seen that this proposition is not an endeavor to get away from the broader objectives of education. On the contrary, it is attempted to expressly include them. But then, in trying to find out to what extent such objectives can be required acceptably, and controlled - or evaluated the analysis leads to a somewhat surprising conclusion. If we take the words 'cognitive', 'knowledge' and 'skills' in a sufficiently broad sense, the result is that all so-called affective-attitudinal educational objectives are cognitive in nature: again, programs learned to be used and steered by the student. Everything rightfully to be learned in school consists of knowledge (know-how included) and/or of skills (mental skills included). I feel this is a rather reassuring outcome. On this basis, the task to implement the list of objectives in an acceptable way does no longer look hopeless, at least. Learner reports: The basic format Proposition 4: If propositions 1 and 2 are accepted, i.e., if what is to be learned in school consists of programs stored 'in the mind' of the student, programs he can use consciously, then it follows that learning effects

The problems of evaluating national education systems

21

corresponding to educational objectives can be listed exhaustively in the form of a set of sentences in which the learner reports: 'I have learned t h a t . . ( o r : 'I have learned how t o . . .'). Comment: In my opinion, this proposition provides a powerful instrument especially for getting hold of not-easily-measurable objectives. In the early sixties, when a few of us were trying to 'sell' the general idea of educational measurement, and in particular achievement testing, to audiences of Dutch educators and educationalists, a standard objection in the discussion after the talks was this: 'Now, for example, I have learned from my teacher, Mr X, to appreciate German literature. He has really opened my eyes to a whole world, new to me. I feel such an experience is much more important than anything you can possibly measure with your achievement tests.' What answer would you have given in such a case? I think, at present, the answer should be, first, that this remark is not really arguing against achievement testing. There are, of course, other objectives next to those measurable by ordinary tests. Secondly, this learning effect, this type of 'fundamental experience' as we like to call it now, may not be measurable but, obviously, it is reportable. In terms of our set of sentences, a summary report would be: 'I have learned that there is this fascinating world within our world.' Thirdly, what 'I have learned' regarding this fascinating world can be adequately reported by means of sentences with specific content - including for instance a sentence like this one: 'I have learned that I may learn to strongly appreciate something (like German literature) even if in the beginning it did not appeal to me at all.' Fourthly, such an analysis in terms of sentences will provide us, not immediately with a measuring instrument it is true, but certainly with a valid representation, or rather with a way of implementing and getting hold of those 'other objectives'. This approach to the problem of the not-easily-measurable objectives appears to promise success. Besides, it may open up another fascinating world within this world. Two dichotomies: World versus self, rules versus exceptions Just one more little excursion to give an impression of the latest developments. We plan to work with a standard partition of the set of specific learning effect sentences. It is based on two dichotomies. First, a logical distinction: one may learn either rules, or exceptions to rules or to ex-

22 Adriaan D. de Groot pectations: particulars, surprises, unexpected things or possibilities existing. Secondly, one may learn something about the world, or, about oneself, including one's own relationships to the world. Keeping in mind the German literature example, the learning effects regarding this particular world, or any other for that matter, can be broken down into four types. I may have learned: (a) some of the rules of, or in, that world; (b) some of the particulars, exceptions, surprises, unexpected things existing, some of the riches of that world; (c) some of the rules regarding myself - e.g., that I have an affinity to that world, that this is something I am good at, or, that I shall always like, always have available as something dear to me; and (d) some important particulars or surprises regarding myself, exceptions to, refutations of my own prejudices about myself, unexpected possibilities open to me - e.g., that it is not true that I must find all school subjects boring, or, that it may happen to me that studying a subject, even an initially unappealing one, leads to being fascinated by it. It will be clear that this two-by-two classification of learning effects is meant to compensate for a two-fold negligence in our ways of thinking about, and operationalizing of, educational objectives. First, we tend to overemphasize the mastery of general laws and rules - things being always so - whether in the form of knowledge or of skills; and to underrate the importance of existential statements and exceptions - 'refutations': things being not so as expected - particulars, possibilities. Secondly, we tend to forget that education fundamentally aims at two things: learning about the world and learning about oneself, in relation to the world.

RESEARCH AND INNOVATION CONSEQUENCES

I realize that I have given a highly fragmentary exposition of an unfinished job. I owe you some sort of explanation, therefore: What significance do these newly developed theoretical conceptions have in relation to our topic, the problem of evaluating NESs? First, then, I feel that the transformation of the problem of objectives implied in the four propositions, up to its end result so far, the set of learning effect statements with its four subsets, will enable educators and measurement specialists to come to agreement. An acceptable formulation

The problems of evaluating national educational systems

23

of that 'set of types of objectives' which is to cover economically what we all want good school education to lead to, should be within reach. In particular, I am convinced that this approach serves its purpose better than that of trying to implement 'non-cognitive objectives' by means of attitude tests and personality measures. In the latter case - and apart from other objections - there is just no solid route between the two well-known swamps: that of low validity and fakeability on the one hand, and that of vague idealist declarations on the other. Secondly, the hoped-for formulations of objectives must be linked to educational practice. This link was patently missing in my exposition, I confess, but we are working at it in another segment of the theory in the making. A fundamental concept in this respect is that of the total examination program. For every given educational program it should be possible to define the set of actual final requirements students are to meet in such a way that it presents an acceptable operational definition of the objectives of that program. Thirdly, to the degree both ideas materialize - that of formulating types of objectives, and that of pinning down total examination programs - we shall have a better basis for formulating NES objectives and for designing corresponding criteria. Fourthly, those objectives and criteria will have to be checked on their adequacy, and possibly supplemented by some new ones, by relating them to the set of acceptable policy goals discussed earlier in this paper. Finally, I would not like to create the impression that I advocate to refrain from attempts at evaluating NESs empirically until this formidable program, or another analogous one, has been completed. As always, the most promising strategy is that of parallel development and interaction of theoretical and empirical studies. As to the latter, it is not necessary to set the research goals as pretentiously as I have assumed them to be. I am not only referring to studies of the scope of the IEA researches, which have proved to be highly instructive and useful - in spite of the fact that many basic questions such as those treated so far had to remain wide open. I am also thinking of comparative fact finding and hypothesis testing on a much smaller scale: bilateral studies, on specific types of programs, for instance; national studies in which different programs are internally compared; or studies in which specific output measures of the system or of subsystems are measured against pre-set expectations. As we shall see at this Conference, such studies may all contribute to NES evaluation, if that is the final goal.

24 Adriaan D. de Groot And then, of course, the development and implementation of theory, itself, can not do without the interaction with empirical work.

SOME SUGGESTIONS FOR INDEPENDENT VARIABLES

I have used up nearly all of the available time by talking about 'purpose questions': goals and objectives. That is, I have tackled the problem of developing theories and hypotheses about NESs primarily from the angle of the output, the dependent variables, the criteria. In the summary, however, I am afraid I promised to introduce 'a few other notions and variables': 'contracts', the dimension 'uniformity-diversity', 'retentivity', and a concept preliminarily called 'goal purity'. Given our time schedule, I had better offer a few comments on the summary text only. All four concepts are meant to serve the construction of independent (or possibly moderator) variables deemed of importance. They are considered to represent strong systematical and/or causal factors in the functioning of an NES. As to 'retentivity', I had better wait with comments until after Dr Postlethwaite's exposition - if there is time left for this relatively minor point. The other three concepts have some properties in common. They are all primarily descriptive and structural constructs, not measurable themselves. They are all easily overlooked, because of their structural nature which requires a thorough descriptive analysis in the first place - and this is something psychologists and other empirical social scientists are not good at. (I am, of course, referring to our previously mentioned 'thoughtavoiding' habits.) Furthermore, in my opinion - based on some experience in analyzing the Dutch system in comparison to others, Anglo-Saxon in particular - all three constructs are of rather fundamental importance. In brief, I feel that many of our national frustrations, namely those of both legally and informally introduced system changes which do not work out as they were meant to, derive from the fact that those much more important system properties have remained intact. Our educational contracts - within a program: the sets of effective, formal and informal obligations of staff and students towards one another - have always been weak, and they have remained so, especially in post-secondary education. On the other hand, the national uniformity - over different institutions - of admission (and qualification) require-

The problems of evaluating national educational systems

25

ments remains anxiously guarded. For instance, even the idea of universities deciding on their own admission policies has remained inconceivable: such diversity would lead to chaos, in Dutch eyes. Finally, practically all of our educational institutions are supposed to have multiple missions: general education and permanent selection, including grade repeating and in-course-selection; vocational education and the preparation of societal change; and, as to our universities, general post-secondary education for growing numbers of students, and high-level professional education, and scientific research. Undergraduate and graduate education are not distinct subsystems. There are no established priorities, nor are clear-cut budgetary, organizational, or evaluative distinctions possible. As a result, different objectives often tend to work out as alibis to one another. These few comments should be sufficient, I hope, to explain why I feel that variables to be derived from a thorough analysis of contracts, uniformity-diversity, and especially variables implementing the idea of 'goal purity' for institutions and programs, should be of crucial importance for NES functioning. In order to derive valid operationalizations, educational researchers might need the help - for 'contracts' - of jurists and - for 'goal purity' - of organizational experts, some of whom might like the idea. We all know that such cooperation is not easily acquired and successfully realized, but that is another solvable problem.

SUMMARY

Evaluating national educational systems (NESs) is supposed to be a rewarding activity on the basis of two underlying assumptions: first, that in some fundamental ways different systems vary. Secondly, that a particular system can be varied - changed, manipulated to some extent - by research-based innovative efforts. Generally, the first assumption is taken for granted, but the task remains to find out which differences are 'fundamental' for purposes of comparative evaluation (see below). The second assumption is rather one in the nature of a conviction, the optimistic conviction that planned, rational change is feasible, even with regard to a history-based, politically sensitive, and highly complex organization (or organism) such as a (European) NES. Methodologically, educational research activities called (summative) 'evaluation' - whether of a course, a program, an institution, or a NES -

26 Adriaan D. de Groot belong to the general category of hypothesis testing. The social and scientific importance of the results of NES evaluation efforts, therefore, depends on the quality of each of the well-known elements or phases of the empirical cycle of hypothesis testing: (1) theory formation, basic conceptualizations; (2) formulation of hypothesized testable relationships; (3) operationalizations; procedures for the collection of data, resulting variables and predictions; (4) data processing and testing procedures, and (5) interpretation and evaluation of outcomes. Past experience has shown that (methodological) quality control can be applied as a matter of routine, and is most effective, in phases (2), (3) and (4) where hard criteria - logical, empirical, statistical - can be set; and that it is least effective in phases (1) and (5) where the criteria of quality, and of importance ('relevance'), remain necessarily less explicit and more flexible. Accordingly, the problem of adequate theory formation (and interpretation) is the crucial one. In the case of NES evaluation in particular, an argument against both easy abstractions and premature model or systems approaches is in order. Straightjackets applied too early to educational reality must be avoided. An input-output system conception of a NES is considered adequate; but then, a thorough descriptive analysis of goals and means is called for, in terms of basic concepts for a theory of NESs. A few such basic concepts will be proposed and related issues briefly discussed. A crucial problem is that of the definition of goals: educational and policy objectives and their operationalizations (budgetary, organizational, and evaluative). Questions regarding the economy and sufficiency of a set of (policy) goal formulations will be raised, along with a discussion of the construct validity of their operationalizations. For instance, can we do with measuring skills and knowledge (achievement criteria)? In particular: How to prevent important but not easily measurable objectives from being left out, in examinations on the one hand, and in evaluation research on the other? For the comparative description of educational systems, subsystems, and programs, a few other notions and variables will be introduced. Systems and/or programs in different countries are considered to differ in terms of the educational 'contracts' students and staff are offered, and are committed to: strong versus weak contracts, selection-free (after admission) versus continually selective contracts, program or course contracts with strict age or time limitations versus contracts open to prolongation and delay decisions. Furthermore, systems differ strongly on

The problems of evaluating national educational systems

27

the dimension 'uniformity-diversity' as regards the admission rules and final examination requirements of nominally equivalent programs. An important variable - used in IEA research - is that of the degree of 'retentivity' (versus selectivity) of NESs and of their subsystems, etc. An important independent NES variable - not easily operationalized, let alone quantified (like many of the others mentioned in the previous paragraph, for that matter) - would seem to be the degree to which educational institutions and programs within a NES are geared to one main educational goal or are conceived as multi-purpose organizations. A worthwhile hypothesis might be the following: A high degree of 'goal purity' for institutions and programs within a NES is desirable. In other words, a NES is better - more efficient, less conflict-bound, more satisfactory to all concerned, more acceptable - in the degree to which its institutions and programs are each more specifically geared to one main, first priority, policy objective.

REFERENCES Bloom, B.S. (ed.) (1966) Taxonomy of educational objectives. The classification of educational goals. Handbook I: Cognitive domain. New York: McKay (1956). De Groot, A . D . (1969) Methodology. Foundations of inference and research in the behavioral sciences. The Hague: Mouton. Husen, T. (1971). International study of achievement in mathematics, II. International project for the evaluation of educational achievement (IEA). New York, London: Wiley (1967). International Association for the Evaluation of Educational Achievement, IEA (1973) International Studies in Evaluation, I, II and III. Stockholm: Almqvist & Witsell. Rogers, C.R., & Dymond, R. (1954) Psychotherapy and personality change. Chicago: University of Chicago Press.

2

T. NEVILLE POSTLETHWAITE National Institute for Educational Planning,

Paris

National evaluation of educational systems: A recent study on the evaluation of 20 educational systems by the International Association for the Evaluation of Educational Achievement

Evaluation of educational systems or of the relative merits of educational systems has, for the most part, been judgmental and based on qualitative impression and analyses. Much of the evaluation undertaken within the discipline of comparative education has been of this intuitive nature. As technical assistance in the educational field has increased there has been an increase in demand for more accurate techniques whereby such assistance in emerging countries can be assessed. At the same time there has been an increased demand in highly developed countries for the evaluation of schools reforms, innovations, etc. The National Assessment of Educational Progress (NAEP) in the United States, the plans for continuous 'qualitative' evaluation' of the Swedish school reforms and the evaluation of the introduction of French as a foreign language earlier in the English school system are cases in point. As interest in education as an investment in 'human capital' and as an instrument for bringing about not only economic growth but also social change increases, particularly in developing countries, so will the need to develop appropriate evaluation techniques. Most of the studies to date on the relationship between education and economic growth have been limited to extremely crude 'output variables', such as enrollment and graduation figures, statistics which, with some justification, could be regarded more as independent than dependent variables. It can be argued that what is a more important 'output' is how many children are brought how far by the system in terms of their cognitive achievement, skills, attitudes, leisure pursuits, etc. It has sometimes been said that the world can be regarded as one big educational laboratory in which a wide range of practices in terms of

National evaluation of educational systems

29

structural features and pedagogical methods are employed. Countries can learn much from each other by relating the outcomes of various systems to different patterns of input factors. Educational research, either by surveys or controlled experiments, conducted multi-nationally has a greater chance of arriving at generalizations transferable to other sociocultural contexts than studies which are limited to only one or a few educational systems with rather similar socio-cultural features. There is a growing tendency among educational policy-makers to require more factual evidence as part of the basis for decision-making. At all levels in an educational system, from the teacher in the classroom through the administrator to the policy-maker, decisions have continually to be made, most of the time on the basis of very little factual information. The teacher, for example, has to make decisions about the amount of homework, the use of methods of instruction and teaching materials. The administrator decides about the proper use of resources, such as teaching staff, the available space and teaching aids. The policy-maker is faced with problems pertaining to the effect of the age of entry to school, the extension of compulsory schooling or the changing of the school structure at a certain level. On the basis of scarce evidence it is difficult to assess the implications and likely effects of such changes. However, multinational evidence produced by survey research can help in providing information that may cast some doubt upon established convictions and produce evidence which can be generalized from one context to another. Several national surveys have been carried out, for instance in the United Kingdom and Sweden, in connection with preparatory committee work which has preceded important educational reforms. The evidence from such surveys has undoubtedly influenced educational policy-making in these countries. Although among the independent variables within a given national system there are some which display a wide variability, however, there are many others which show little or no variation at all, for instance, age of school entry and school structure. It is only when the school systems of the world are examined that wide variation in school organization, teacher training, curriculum content, classroom practices, social and economic environment of the schools, and so on, can be seen. It is in this context that multi-national educational surveys can be carried out, surveys which take advantage of the greater variation observed in both independent and dependent variables across the national educational systems of the world. By using the same methods and instruments throughout the survey, a series of replications can be provided which in

30

T. Neville Postlethwaite

its turn can increase the degree of generalization of the findings. The International Association for the Evaluation of Educational Achievement (IEA) has undertaken research in 22 different school systems whereby the variance both between-schools and between-students has been measured in various outcomes at various levels in each of the school systems. Since there were a certain number of common items between the output measures used at each of the levels it was also possible to measure 'growth'. This paper aims to present: (a) a short history of the IEA work; (b) certain aspects of the methodology; and (c) ceitain results from the most recent six-subject survey.

HISTORY OF IEA

Exploratory work During the middle 1950s educational researchers, in attempting to deal more adequately with such problems as failure in school, examinations, and evaluation, became increasingly awaie of the need to establish evaluation techniques which would be valid internationally. It became apparent that there were many questions that could not be answered by a national survey because it was considered undesirable or impractical to experiment, and insufficient variation appeared within national systems to allow answers to be inferred. In order to assess the feasibility of large multi-national surveys, a small pilot study was undertaken by a group of researchers in 1959. The countries taking part included Belgium, England, Finland, France, Germany, Israel, Poland, Scotland, Sweden, Switzerland, the USA and Yugoslavia. The target population was all children aged 13 : 0 to 13 : 11, since this is the last point where all of an age group are still in school in all those countries. A judgment sample of almost 10,000 children, covering eight languages, was administered tests of reading comprehension, mathematics, science, geography, and non-verbal ability. This pilot study (Foshay, 1962) not only demonstrated the feasibility of a multi-national educational survey, but also provided information which was useful in the generation of hypotheses for future IEA surveys.

National evaluation of educational systems

31

Phase I - Mathematics In late 1960 the IEA Council was formed, and work began immediately on Phase I, IEA's first major project. In each of the countries the study was carried out by a so-called National Center, usually a university institute with research capabilities. The countries in this phase of the IEA work included Australia, Belgium, England, Finland, France, Germany, Israel, Japan, Netherlands, Scotland, Sweden, and the USA. The main objective of the study was to investigate the 'outcomes' of school systems by relating the output measured by the test instruments to a large number of input variables considered to be relevant. In the conceptual stage of the project, education was considered to be a part of a broader social, political, economic system. Thus, to compare cognitive achievement and attitudes in isolation from this broader context would be meaningless. Not only do countries differ with respect to output (cognitive and non-cognitive), but they also differ with respect to a wide variety of inputs, such as economic resources, urbanization, social backgrounds of children, education of parents, training of teachers, structure of the school systems, etc. Viewing education in a broad social, political, and economic context, a number of hypotheses were formulated which were considered to be of importance to all countries participating. These fell into three categories: (1) hypotheses concerning school organization, selection, and differentiation; (2) hypotheses concerning curriculum and methods of instruction; and (3) hypotheses concerning sociological, technological, and economic characteristics of families, schools or societies. Of course, measurement or indication of these variables was often rather crude, but it was recognized that on the basis of the first study, more refined methods of quantitative assessment could be developed. Ideally, a study undertaken to test the hypotheses that were generated would involve longitudinal administration of a comprehensive battery of evaluation instruments to samples of students over time. However, at that time no international achievement tests existed, and the administrative machinery required to conduct a longitudinal study was considered prohibitive. Therefore, it was deemed necessary to restrict both the subject matter tested and the duration of the study. For a variety of reasons it was decided that mathematics would be the first subject area to be investigated. In the early 1960s most of the countries participating in the IEA study were particularly interested in improving their scientific and technical education, the basis of which is

32

T. Neville Postlethwaite

mathematics. There appeared to be international agreement on the aims, contents, and methods of mathematics education, and finally some of the countries in IEA were then participating in international programs or conducting their own research programs in mathematics education. The definition of the target populations was complicated. It was decided to test at two major terminal points in each country, namely, the last point where approximately 100 per cent of the age group was still in full-time schooling and at the pre-university year. However, because it was difficult to select populations which were comparable in terms of their place in the educational structure, it was finally decided that the target populations would be all 13-year-olds (Population la), all pupils in the grade where the majority of 13-year-olds are three months before the end of the school year (Population lb), students in the pre-university year who are currently studying mathematics (Population 3a), and students in the pre-university year who are not currently studying mathematics (Population 3b). One intermediate population, students in the terminal year of compulsory schooling, was optional. Probability sampling was used. Thus, the central sampling problem was to secure representative samples from the target populations. A stratified two-stage probability sampling design was employed in most countries, the first stage being the selection of schools and the second stage being the selection of students within schools. The preparation of a comprehensive international test battery required the joint efforts of experts in the teaching of mathematics and experts in mathematics testing. The overall objective was the construction of an internationally valid set of instruments with a wide range of content and objectives, not merely a common core of subject matter. An international committee brought together reports prepared in each of the participating countries on the content and objectives of mathematics education of pupils from the 13-year-old level to the pre-university level. Most reports also provided examples of appropriate test items at various levels. From this material the committee prepared outlines and ultimately prepared the preliminary pool of test items. Some 640 test items were checked for mathematical correctness, precision of statement, and appropriateness for testing significant outcomes. From this pool of items, preliminary tests were prepared and circulated to each of the national centers for comments and criticisms. On the basis of these reactions, a number of changes were made in the formulation of specific items, and some were added, with the result that in the end there were 14 different preliminary versions of the

National evaluation of educational systems

33

tests. In order to have some empirical basis for selecting items to be included in the final versions of the tests, each of the preliminary forms was pre-tested on judgment samples in at least four of the countries. The pre-tested items were analyzed to find the proportion of students giving each alternative answer, the difficulty index, and the discrimination index. From this information an editorial committee met in 1963 to draft the final form of the test. In all, nine one-hour test units were prepared. About 85 per cent of the items dealt with standard topics in arithmetic, algebra, geometry, analysis, and calculus, while approximately 15 per cent dealt with sets, probability, logic, and other less standard topics. In addition to the achievement test, some non-cognitive outcomes of education were also measured. These included attitudes toward mathematics as a process, the place of mathematics in society, school and school learning, man and his environment, and finally, attitudes about the difficulty of learning mathematics. Also elicited from the students were descriptions of mathematics teaching and learning, and a description of school and school learning. Questionnaires were produced for students, teachers, and school headmasters, and an education expert in each country completed a national questionnaire. Information collected on the student questionnaire concerned grade, sex, age, size of mathematics class, amount of mathematics instruction and homework, father's and mother's education and occupation, aspirations and expectations for further mathematics, further schooling and future occupation, best and least liked subjects, examinations taken, and extra-curricular mathematics activities. Information collected from the teacher questionnaire included teacher certification in subject matter and professional training, teaching experience, recent inservice training, experience in 'new mathematics', and teacher autonomy. Information collected from the headmaster concerned school enrollment, number of male and female full-time teachers, number of trained mathematics teachers, type of school, the amount of educational expenditure, age range of pupils in the school, and school finance. The education expert in each country gave information about the number of pupils in full-time schooling according to school type, selection process, compulsory schooling, economic data to determine the degree of economic, industrial, and technological development, and sociological data to determine the role of women in society. In all 132,773 students, 13,364 teachers and 5,348 headmasters participated in the study, yielding 50 million bits of information.

34

T. Neville Postlethwaite

The data from each of the test instruments and questionnaires were transferred to punched cards and ultimately to magnetic tape. Statistical analyses were carried out at the University of Chicago Computer Center. The writing up of the findings was a collective effort which involved reviewing the literature and compiling selected references, examining the computer output, and helping in further processing and interpretation of the data. Most of the interpretation of the analysis was done during the first half of February 1965 at a meeting at the University of Chicago. The work was organized so that those writing on hypotheses could actually hold a 'dialogue' between themselves and the computer. A writer was able to ask for a particular analysis one evening and receive it the next morning. In addition to the international analysis, national analyses were also carried out. The writing up of the study and the results are reported in two volumes (see Husen, 1967). Today these test items and data are available for research purposes. Phase II - Six subject areas Building on the experience of the mathematics study it was decided in 1966 to evaluate educational achievement in science, reading comprehension (including reading speed and a word knowledge test consisting of antonyms and synonyms), literature, French as a foreign language, English as a foreign language, and civic education. As in the mathematics project, not only were cognitive instruments produced, but also attitude scales and questionnaires. Countries who participated in Phase II included Australia, Belgium, Chile, England, Germany, Finland, France, Hungary, India, Iran, Ireland, Israel, Italy, Japan, Netherlands, New Zealand, Poland, Rumania, Scotland, Sweden, Thailand and the USA. A number of hypotheses were generated by subject matter specialists and specialists in education as a discipline. In addition, two conferences were held bringing together specialists from all the social sciences to examine the results of the mathematics study and to develop theories and hypotheses within their own disciplines which could be tested in an IEA survey. These hypotheses have tended to fall into three categories: (1) hypotheses concerning subject matter; (2) hypotheses concerning schooling in general; and (3) hypotheses concerning the students' background. From these hypotheses variables were produced and given to the IEA Questionnaire committee. The committee screened these variables in

National evaluation of educational systems

35

order to determine which were capable of being translated into pencil and paper items and suitable and acceptable for a large-scale survey. In Phase II it was decided to test three major populations. Population I was defined as all students aged 10: 00 to 10: 11 at the time of testing. This population was chosen because at that age nearly all children can read, but they have not yet left a general classroom for a subject matter specialist teacher. Population II was defined as all students aged 14: 10 to 14: 11 at the time of testing. This was then the last point in most of the school systems where the whole age group is still in school. Population IV was defined as all students who were in the terminal year of those fulltime secondary education programs which were either pre-university programs or programs of the same length. National centers also had the option of defining a Population III, a major terminal point in the school system between Population II and Population IV. For the preparation of the instruments, international subject matter committees were set up. To facilitate their efforts, subject matter committees were also set up in each of the countries for each of the subjects. Their task was to carry out a content analysis in terms of weightings of emphasis given in major textbooks to topics and objectives, a content analysis of national examinations, where they existed at the target population levels, and a content analysis of what a panel of teachers said they taught. Furthermore, national committees were requested to submit test items to the international committee. Once the preliminary versions were produced, national committees were asked to comment on them, and preliminary versions were modified on the basis of these comments before being submitted for pretesting. When the instruments were pretested, item analyses were performed, and national committees were asked to comment on item analyses. On the basis of the item analyses and the comments of the national committees the international committees proposed a final pre-test version, and national committees were again asked to comment on items. On the basis of these comments and in some cases several pre-testings, the final versions of the tests were constructed. In addition to the cognitive instruments, a number of attitude and descriptive scales were developed. These included: like/dislike school, need achievement, interest in science, science and self, science and the world, literary transfers (perceived participation in fictional literature situations), literary interest, science teaching (traditional textbooks versus experimental continuum), science laboratories (given instructions versus

36

T. Neville

Postlethwaite

own experiments continuum), and school environment (authoritarian versus permissive continuum). Questionnaires were also produced for students, teachers, school headmasters, and national experts, on education. These questionnaires were similar to those in the mathematics project, but they were, in general, more comprehensive. Thus, student questionnaires elicited more detailed information about the home environment; teacher questionnaires elicited more detailed information about training; questionnaires for school headmasters elicited more detailed information about decision-making and allocation of resources; questionnaires to national experts elicited more information about the society, economy, policy, and culture. After three years spent in producing the measuring instruments it became quite clear that it would be impossible to test all appropriate students in all subjects at the same time in the sense that the testing load for some students would have been about 24 hours. The testing was therefore split into two parts. Science, reading comprehension and literature were tested in 1970 and French, English and civics in 1971. The testing was conducted, in general, approximately three months before the end of the school year. With different school years and different hemispheres to take into account the testing took place anywhere between January and November 1971. There were, of course, many snags. One problem was to collect accurate data from the students themselves about the homes they came from since it was known that the home would be a very powerful predictor in accounting for variance between students and also when aggregated to form the neighborhood effect between schools. A small pilot study was undertaken in 1967 to determine whether 10-year-olds could provide accurate responses to questionnaire items. Several classes of 10-year-olds were administered a short questionnaire consisting of questions about the parental occupation, education, parent-child interaction in the home and parental aspirations for the child's future education. The results of this study were fairly clear. Those questions which asked the student to describe some aspect of his present life situation showed a high degree of agreement with the responses given by his mother. There was considerably less agreement between mother and child on items which were retrospective and prospective in nature. Although the mother-child agreement is not proof of the truth of the response, it is sufficiently reassuring so that on the basis of the pilot study, it was decided that a questionnaire study of 10-year-olds was feasible.

National evaluation of educational systems

37

CERTAIN ASPECTS OF THE METHODOLOGY EMPLOYED

Test construction The example of the construction of the cognitive science tests. For these instruments, it was first necessary to obtain as complete a view as possible of the science curricula of the participating countries at the three levels of testing: 10-year-olds, 14-year-olds and the pre-university grade. The International Committee prepared a tentative grid of content areas and objectives and circulated this to national committees asking them to extend this on either axis according to what was taught by the time of the modal grade in the cases of the two chronologically defined populations and for the actual pre-university grade. Three methods of content analysis were suggested so as to arrive at the national grids: (a) of the major textbooks and/or syllabi; (b) of national examinations where they existed for the target populations; and (c) of what groups of teachers (for example, for different school types) said they taught. The various national grids were merged into a total international grid. A set of ratings of each cell was then obtained from each national center concerning the amount of emphasis given to each cell in the teaching of science for the target population in question. At the same time hypotheses were framed. On the basis of the ratings and the hypotheses so far advanced to be tested, the International Committee, in collaboration with the national committees, decided on which cells to test. Items were then supplied from existing tests or were written by members of both national committees and the International Committee. All items from existing tests proved to be in need of editing by the International Committee. Emphasis was put particularly on producing items measuring the higher scientific abilities and those testing special abilities such as the design of experiments or the handling of scientific apparatus. Items were first selected from the point of view of appropriate subject area coverage and, as far as possible, equal representation of the contributing countries. The final decision in each case depended on whether the item, in the Committee's opinion, was potentially a good one. The items were then put in a common form using multiple choice with five alternative responses, and new items were devised to fill in the most obvious gaps in

38

T. Neville Postlethwaite

subject area coverage. Rough drafts of the pre-test forms were then sent to national centers for comment. After replies were received, pre-test versions were prepared. In all, just over 1,600 items were pre-tested in order to arrive at final tests containing about 400 items. Pre-testing of items was carried out by 16 countries early in 1968. The testing load was kept manageable by rotating the different test versions among countries. Before undertaking the pre-testing, national centers were given advice on how to deal with difficulties of translation, the use of popular and scientific terms and units and the substitution of local plants, animals and materials for any unfamiliar ones used in the drafts. Guidance was also given on the administration of the tests so that standard procedures could be followed and the same kind of item analyses undertaken. The results of the pre-testing, carried out on judgment samples of 100 to 200 pupils for each population and sub-test, were item analyzed by national centers and submitted to IEA Headquarters where they were collated. Final selection of items and their arrangement in tests was carried out by the Science Committee at a meeting in July 1968. The items for each cell to be tested were selected in terms of a priori validity, their difficulty and discriminatory power. In the final tests, there were 14 items common to Populations I and II, and 20 to Populations II and IV. It was felt that the Science Committee should also make an attempt to assess the students' ability to understand the nature and methods of science as distinct from the purely cognitive aspects. To this end, a test which drew heavily on the TOUS (Test on the Understanding of Science and Scientific Principles) tests devised at the University of Chicago was compiled and pre-tested in September/October 1968. Comments were received from II countries and full pre-test results, including item analysis, from eight. On the basis of these data, it was decided to include a separate test on 'Understanding the nature of science' in the test battery. In science, one of the major differences between countries in the teaching objectives and behavioral categories was in regard to the place accorded to practical work in the laboratory or field. Many of the new developments in science education are concerned with the question of the nature and extent of the first-hand experience that is desirable during the study of science at school. In fact, one of the most important hypotheses to be tested by the science study is that students learning science through actual enquiry by sound scientific methods will achieve higher total test science scores than students being taught by traditional methods.

National evaluation of educational systems

39

Since administering laboratory practical tests would have created difficulties in many countries because of the demands on time, equipment and space in the schools, it was decided to incorporate six paper and pencil items aimed at measuring the results of practical experience in each of the cognitive tests for Populations II and IV. Laboratory tests requiring a minimum of apparatus were also prepared, but it was optional for countries to administer them or not. As well as a total score for science, for Population II for example, there were also sub-scores available as follows: earth science, paper and pencil science, biology, chemistry and physics. Furthermore, there were also sub-scores on objectives: (a) functional information; (b) comprehension; (c) application; and (d) higher processes. Number of items per test Although the foregoing description of test construction has referred solely to science, the results reported later in this paper refer to reading comprehension and literature as well. The final number of items in any one test varied and are presented below; they represent the maximum scores possible in the test results (Table 3): Population I Population II Population IV

Science

Reading

Literature

40 60 66

39 42 38

— 36 36

Sampling The main object of sampling was to estimate national mean values of key variables in each school system as economically as possible, with the lowest possible errors of sampling. After the target populations were defined (see above) each target population was divided into an excluded population and a sampled population. Students were excluded from the study if they were in special schools or classes for physically or mentally handicapped children, or if they were in a category of schools that would have been either extremely expensive to sample or were so small that the data obtained would make little difference to the estimated mean values. Typically two-stage stratified probability samples were employed to represent the sampled population. In the first stage of sampling, schools

40

T. Neville

Postlethwaite

were selected with a probability proportional to the size of the school. In the second stage students were selected from within the school with a probability inversely proportional to the size of the school, so that from each school approximately equal numbers of students were drawn, although each student had the same non-zero chance of entering the sample. Sampling errors were reduced by stratifying the schools, the common stratifiers being the size of the school, the type of school, the region served by the school and whether the school was single-sex or coeducational. In planning the sampling procedures a guiding principle was to keep the average number of students selected from each school at about 30, with as many schools as possible depending on the manpower and money available to carry out the survey. There were two deviations from this general strategy. The first was the use of three-stage sampling with the first stage being the sampling of communities or administrative areas where the countries were very large, for example, in India, Iran, or the United States. There was an additional difficulty in the Indian case. Not only is the country of immense size, but the principal language varies from one part to another. Consequently, it was decided to limit the survey to the six States where Hindi was the principal spoken language. Within these States there was already in existence a master sample of administrative areas, and an accurate threestage sample could be drawn, with the second and third stages being schools and students within schools as in other countries. The second deviation was where whole classes were sub-sampled within schools and not individual students. This procedure was used at Population IV in both France and Sweden. Where a class consisted only of students included in the defined population, and where the classes within the defined population could be ranked in an order of merit, this was perfectly satisfactory, provided the rankings could be obtained and allowance made for these rankings of a class within a school in the regression analyses. In all cases the sampling plans prepared by each national centre were checked by the international sampling reference before the plans were finalized and the schools approached. To cover cases in which official records were out of date or cases in which circumstances prevented a school from taking part in the survey, a second sample of schools was prepared from which a replacement could be found to match any school that declined to participate. From each school approached a list of all students belonging to the defined population was obtained, and the

National evaluation of educational systems

41

national centers prepared for each school a set of materials with each student, specified by name and number, who had been drawn in the sample to be tested. It was perhaps inevitable that some schools were unable to take part in the survey at the last moment and that some students drawn in the samples were absent from school on the days testing took place. Furthermore, some students failed to record a response to important items in the battery of tests and questionnaires, while some school principals failed to reply to the school questionnaire and some schools had no teacher who answered a teacher questionnaire. In such cases students and schools had to be omitted from one or more stages in the analyses of the data. The actual number of students and schools who supplied data and were included in the main science analysis have been recorded in Table 1. National centers were responsible for the sampling designs adopted in their own countries in accordance with instructions and advice set out by IEA bulletins and discussed at the meetings of national technical officers. The sampling designs and, more especially, the principles underlying them are reported in detail (IEA, 1973). Their purpose was to arrive at an effective sample without imposing too great a strain upon the national resources or on the schools picked in the sample. Although each country submitted its sampling plan to the IEA sampling referee for approval, responsibility for the quality of the sample drawn and the data included in the main analyses rested with each national center. To a limited degree it was possible to correct for the shortfall of students in any one stratum of a sample by computing sample weights and employing these student stratum weights in the calculation of population statistics. However, two countries, India and Iran, provided insufficient information to allow student stratum weights to be computed and for these countries each student in the actual sample was given equal weight in the calculation of national statistics. Since information was collected from students, teachers and schools in this study on about 500 different variables with known errors of sampling (and measurement) for at least students and schools, it will be recognized that even the univariate distributions on these variables at these different levels in each of the 22 different school systems provided an enormous amount of information hitherto not available. Statistical analyses For each sample of each target population in each subject area two main

42

T. Neville Postlethwaite

types of analyses were undertaken, namely between-student and between-school analyses. Since the main analyses were regression analyses there was a problem of degrees of freedom in fitting constants to observations in the between-schools analyses. As can be seen from Table 1 (which indicates the overall size of the operation), the number of schools for any one analysis ranged from 15 (Thailand, Population IV) to 327 (Italy, Population II) whereas there were some 500 predictor variables. The task of reducing the number of predictor variables was carried out in a sequence of steps. First of all certain variables such as the father's occupation categories, type of school, and type of program were criterion scaled using reading comprehension or word knowledge scores as the criterion. A home background composite was formed whereby father's occupation, father's education, mother's education, number of books in the home, use of dictionary in the home, and size of family were weighted into a single home background variable. The weights used were regression weights when reading comprehension was regressed onto the single variables. This composite is also important when assessing the relative importances of school and teacher variables. A yachting analogy may help at this point. In yachting, the performance of a skipper and his crew is judged not in terms of the first past the finishing post, but rather by the time taken when the dimensions of the yacht and its sails have been taken into account. Each yacht is commonly given a handicap depending on its waterline, length and its sail area, and the actual time taken is adjusted for this handicap before performance is assessed. Similarly, in the analysis of the data collected in this investigation for any one country, when comparisons are being made between schools, what is important is not the actual level of performance of the students in a school, but rather what the school does with the material it receives. In part the raw material that a school receives is determined by the community with which it is linked, and previous research suggests that the socio-economic level and the inter-related cultural level of the community make important contributions. These measures together describe for the school its contextual setting, and the effectiveness of the education provided by the school must be assessed by what is achieved, after allowance has been made for the handicap that some schools are given because of the communities in which they are set. After these initial compository steps all variables were then correlated

National evaluation of educational systems

43

with the main criterion measures, as well as with the scaled variables for father's occupation and type of program, where such a variable could be formed. In the second step only the data for Population II were examined, and the partial regression coefficients for each variable were calculated after regressing on the home background composite. These partial regression coefficients were displayed visually, with the values for all countries and for each criterion measure being plotted on a number line. From this visual record it was possible to select those variables that had potentially strong relationships with the criteria. This was done by identifying variables having a median regression coefficient across all countries numerically greater than 0.1, or for a specific country having a partial regression coefficient numerically greater than 0.2. In cases where the number of schools included in the sample was small, some allowance was made in the selection of variables which were of importance for a particular country. This procedure was repeated in the third step by regressing the criterion on both the home background score and type of program where it existed, or type of school where a suitable type of program variable could not be formed. In this way a short list of variables was selected for the between-schools regression analyses in, for example, science at Population II. In the selection of variables for Populations I and IV, the simple correlations were examined first and a reduced list of variables was prepared. The variables in this shortened list were controlled for both the home background score and type of program or type of school. From the visual record of the partial regression coefficients a final list of variables for inclusion in the science regression analyses was formed. During this process of selecting variables for later use, attention was paid to several points. If for any variable the data were missing for 20 per cent or more of either schools or the students in the samples, then the variable was deleted. Secondly, if a variable was known to be associated with a question containing a serious ambiguity, it was also discarded. Thirdly, in the initial coding some variables, for example, size of class and proportion of time spent on practical work, were not only associated with non-linear relationships, but also from the wording of the questions asked proved difficult in science to combine across the four branches of science. These variables were also rejected and set aside for further investigation. From these operations three lists of variables were formed: (a) variables that were clearly related to science achievement for all countries;

44

T. Neville Postlethwaite

(b) variables that were clearly related to science achievement for one or more countries; and (c) variables that for one reason or another had been discarded. Variables in all three lists are important because even those included in list (c) frequently involved measures for which a strong link with achievement in science had originally been hypothesized. The evidence from the sorting and sifting procedure indicated that either these variables were unrelated to the criterion or they were so strongly related to type of school, type of program or the home background score that they had no distinct effect of their own, even though some may have had significant zero intercorrelations with achievement in science. Even after the rather harsh selection procedures described above, there were more variables listed for inclusion in the regression analyses than could be effectively handled. It was, therefore, decided to form some of the variables into compounds where possible. At one time it had been considered desirable to form clusters and composites for all countries using the same variables and using a common set of weights. When the partial regression coefficients, after regressing out the effects of the home background score and type of school or type of program, were examined, it was clearly not possible to employ an international set of weights. It was therefore decided for variables in list (a) to use the same composites for all countries, but to assign weights that would be different for each country. For the variables in list (b) it was necessary to form composites that were unique for each country. The weights used for each variable in a composite were integer values based on the regression coefficients obtained after the effects of the home background score and type of program or type of school were partialled out. Where a variable appeared to be operating as a suppressor in a manner that could not be readily understood, it was assigned zero weight in a cluster. Variables were only linked together in a composite if they were notionally associated together, and if the data for them were obtained from the same source. In this way measures that could be readily understood were derived without the distortion that might arise if the data were obtained from different sources. In order to understand some of the tables to be presented in the results section later in this paper the rationale of the regression analyses should be explained. The example of science is used but basically the argument applies to all subject areas. In planning the regression analyses a model of causation had to be

National evaluation of educational systems

45

advanced which would relate the various input measures to the outcome of achievement in science. The basic proposition underlying the development of a causal model was that earlier events influence later events. The first variables to enter the regression equation must be those associated with the home into which the child was born, including both socioeconomic status and educational level. In a between-schools analysis these measures of the home, obtained from each student, are averaged to indicate the education or status level of the neighborhood or community linked with the school. Certain descriptive characteristics of the students in the school are entered into the regression equation at the same time as the home variables, for example the date of birth and the sex of the student. Both of these variables may be regarded as fixed and not subject to change over time. The second set of variables to be entered into the analysis are measures of the type of school the student attends and the nature of the program offered by the school. In part these variables reflect the home circumstances of the students and in part the students' native wit and previous educational experiences. This is particularly true in those school systems in which allocation takes place to schools of different types, or to different programs or courses within the schools. Not only do the variables entered at the first and second stages into the regression equation model precede in time the variables entered at the third stage, but they may also be regarded as determining the sets of variables that follow. Depending on the neighborhood in which the school is located and on the ability of the students it draws, so the programs or courses offered by the schools are determined. These in their turn would seem to influence the set of variables that characterize the situation in which learning occurs. Such variables may be concerned with the school, its size and its facilities, and its practices; the teachers, their age, experience and training; and the students, their exposure to learning both at school and at home. The many variables derived from the school, the teacher and the student questionnaires describe the conditions under which learning takes place, and they are made available for entry into the regression equation at this third stage. The fourth set of variables are of doubtful status in any model of causation, since they are largely contemporaneous with the achievement outcomes. The attitudes of the students towards school life and learning, their expectations for further education and a further occupation, their current reading habits and leisure time activities, together with certain current practices of their homes, are all factors that both influence

46

T. Neville Postlethwaite

achievement and are currently influenced by achievement. Such variables may have considerable predictive power, but their status in determining achievement is uncertain. Nevertheless, it is of value to include these variables in the regression equation in order to estimate the size of their possible effects. While there was a very large number of variables available for examination in this inquiry, it was recognized from the outset that in each analysis carried out there would be a substantial proportion of the variance unexplained. The failure to account fully for the variation in the achievement test scores would be partly due to errors of measurement; partly to other powerful factors, such as the skill of the individual teachers; and partly to the innate ability of the students in the schools. Therefore, word knowledge was entered as Block 5 and reading comprehension as Block 6 as surrogates or part surrogates for such variables. Although the analyses have been described in between-school terms the approach was basically the same for the between-student analyses.

SOME SELECTED RESULTS

The selected results will be: (a) between-nations; (b) between-schools; and (c) between-students.

BETWEEN-NATIONS

With an N of 20 school systems, multivariate analyses are not possible. Some snippets of the type of 'between-nations' analyses are presented. 1. Social bias It is of interest to note the difference in social class composition of 14year-olds in school and those students remaining in school in the final grade of secondary schooling. Because of the difficulty of finding an internationally valid social class scale each country used a national set of occupation categories. Nevertheless it was possible to collapse categories in all countries into four major groups: unskilled and semi-skilled;

National evaluation of educational systems

47

skilled; clerical; and professional/managerial. The formula used is given at the foot of Table 2. Table 2 presents the proportions of an age group in school at each population level and the social bias index between Population II and Population IV. For all data concerned with Population II it should be noted that the Federal Republic of Germany refers to Gymnasium students only and Sweden to Gymnasium and Fackskola students only. A high index indicates bias in favor of higher social class students. 2. Intended curriculum, actual curriculum and student performance As already explained ratings were awarded to each cell in the science grid (content x objectives). Since the science tests were so devised as to sample the whole grid an index of the intended curriculum (through official syllabi, etc.) could be regarded as the sum of the ratings divided by the number of content rows or cells. The index of the actual curriculum was taken to be the mean of the 'opportunity to learn' ratings1 of the test items constituting a science sub-score or total score. The index of performance was taken to be the mean of the students' performance on the appropriate test items. An average grand standard deviation for each of these indices for all countries was computed and standard scores for each produced. Figure 1 gives the relative strengths of intended to actual curricula to performance for total science, biology, chemistry, physics and practical science scores for Australia, the Federal Republic of Germany and the United States for Population IV. Thus, for total science, Germany has recorded the highest index for intended curriculum; Australia has a lower index, and the United States the lowest index. The opportunity to learn again is most in Germany, less in Australia and least in the United States although in every case the teachers say that in the classroom they give more opportunity to learn than one would suppose from the textbook analysis. In Germany, the students perform best (although less than would be supposed from the teachers' ratings of opportunity to learn), in Australia, the students perform better than would be supposed from the opportunity to learn and in the United States slightly better.

1. All science teachers in each school in the sample were asked to rate each item in the test as to the percentage of students in the target population having had the opportunity to learn the principle embodied in the item.

48

T. Neville

Postlethwaite

Figure 2 plots the national relationship between science performance and opportunity to learn, and is particularly impressive since it shows the importance of opportunity to learn, which, as measured in this research, is the curriculum offered by the teachers in the classrooms. 3. Performance of the élite It is often argued that 'more means worse' in the sense that the higher the proportion of an age group allowed into the final year of schooling not only will the average achievement be lower but also the achievement of the élite students will be lower. Table 3 presents the percentage of an age group in school and the average score of those students, the mean of the top 1 per cent, 5 per cent and 9 per cent of an age group. This, of course, assumes that the students in school would perform higher than those not in school. It will be seen that more does not necessarily mean worse. Figure 3 presents the 'yield' of a system at Population IV level in science of the various systems. The cumulative percentile frequencies for each country are plotted against the proportion of an age group still retained in school. Although yield is to a certain extent, a function of retentivity, it is so only to a certain extent.

BETWEEN-SCHOOLS

4. Between-school variance as a percentage of between-student variance Table 4 gives the between-school variance as a percentage of the betweenstudent variance for science for Populations I, II and IV. In all three populations the range is high with differences between-schools accounting for high proportions of the total variance in selective school systems and in some developing countries. It will be noted that for general school systems in Sweden, school (and ecological) effects account for the least variance. Where the between-school variance is a high percentage of the betweenstudent variance, this is a function of both the differentiated neighborhood effects and differential allocation of resources. The 58.4 per cent for Sweden for pre-university students is spuriously high because this is a mixture of 11th Grade (Fackskola) and 12th Grade (Gymnasium).

National evaluation of educational systems 5. Between-school and between-student

49

variation

Tables 5 to 10 present the variance accounted for by blocks of variables entered into the regression equation in the order given: Block 1 consists of the home background composite plus age and sex. Block 2 consists of the type of school and type of program variables. Block 3 consists of the teacher and school variables which survived the purging by the partial exercise described above. Block 4 consists of the contemporaneous variables described above. Block 5 is the word knowledge test if given in a country. Block 6 is the reading comprehension if given in a country. Where a country had too small a number of observations, the analysis was not undertaken. In some cases (the two Belgiums, Iran and the Netherlands at Population IV) the analyses were undertaken but even though the matrix did not become singular, care should be taken in interpreting the results because of the basic instability of the matrices. The variance (the standard deviation squared) between-students is the range of scores on the dependent variable (for example, science in Table 5) from the best to the worst student. The variance between-schools is the variance between school means (i.e., from the best to the worst school). The reader can peruse these tables himself. In the last resort the interpretation of evidence must depend upon memory, introspection and testimony and these may differ from one interpreter to another. Some comments are made at this point with which readers may or may not agree, however. The home (plus sex and age) accounts for the most variance. This is particularly true at the 10-year-old level. As one moves up the school system the teacher and school variables (Block 3) gain in their relative importance, particularly in school-orientated subjects (e.g., science). This result differs slightly in emphasis from the results of the Coleman study, in part no doubt due to the fact that the criterion used by Coleman was mostly reading and if causality is to be inferred it is clear that reading is very much home determined. At the same time, it should be borne in mind that as a school system becomes more selective as one ascends the school system so the variance of the home backgrounds will be restricted and hence account for less of the variance. Table 11 presents those variables in Block 3 emerging in the regression analysis and after Blocks 1 and 2 have been removed. Whereas there was a pre-determined order for entry of blocks, there was no stated order for

50

T. Neville Postlethwaite

variables within blocks, i.e., the order was by size of the next largest partial. However, the figures entered in the tables are b(2/c) x 100 where b(2/c) is the equivalent of the amount of variance accounted for when the variable is entered as the last variable in Block 3. A* indicates that the P (beta) has a negative sign (irrespective of the sign of the zero order correlation). In other words, the 18.3 per cent of variance accounted for by amount of science study per year and amount of homework per year is a minimum estimate. It is interesting that the amount of study and homework does make an impressive difference, as well as the number of years studied, and this would seem to be food for thought for those educators who often strike the cord 'that schooling does not make a difference'. Clearly, the time spent studying a subject, the amount of homework and, to some extent, the pre-service teacher training are important variables. A word of warning should be sounded at this point. The betweenschools analyses presented here are unweighted. The weighted analyses will be forthcoming and published in the IEA forthcoming publication series. However, where the achieved samples were close to the design samples, the weighted and unweighted analyses will be much the same but differences will occur in the analyses if the differences in design and achieved samples were substantial. The between-student analyses are over-weighted. The tables presented are selected tables only, and such information exists for every population for every subject.

CONCLUSION

The comments above are brief but important. Readers will however, glean much more from the tables themselves. The detailed results of the six-subject study will be published in nine volumes by Almqvist and Wiksell, Stockholm, throughout 1973. For further details of these publications, write to: IEA, Box 6701, S-11385 Stockholm, Sweden. Although there were some 500 independent variables for each subject analysis it is clear that in most cases there are only some 50 to 80 variables (depending on the subject) which turn out to be important when other factors are controlled for. This implies that it should be possible to streamline surveys of this nature in future by reducing the total amount of

National evaluation of educational systems

51

data to be collected and processed. On the other hand, in many countries the variable used failed to account for more than 50 per cent of the variance. There are perhaps some lessons to be learned from the testing carried out in the developing countries. Professor R. L. Thorndike of Teachers College, Columbia, points out in his publication on reading comprehension how, by an arbitrary measure of illiteracy, a high proportion of children (approximately 50 per cent) in developing countries are illiterate, whereas the corresponding percentage in developed countries is 8 to 10 per cent. This casts doubt on the children's ability to actually read the tests and questionnaires administered to them which in turn casts doubt on the results. In some cases the home variables do not behave in the same way as in western countries, e.g., size of family is not negatively correlated with achievement in India. Furthermore, the total amount of variance accounted for is generally less in developing countries. The tests, despite all the precautions taken in the pre-testing, proved to be very hard for some developing countries and there was not sufficient bottom to the tests and hence it is likely that the standard deviation is not a realistic one but a restricted one. However, work is required in developing countries in terms of collecting the data reliably and conceptualizing further on the predictor variables. Some readers may not agree with the plan of analysis adopted either in the reduction of variables or in the order of entry into the regression equation. IEA is establishing a data bank where all the raw data will be stored in a systematic way and this data bank can be used by scholars across the world again by applying to IEA at the address given above. REFERENCES Foshay, A.W. (ed.) (1962) Educational achievement of 13-year-olds in twelve countries. Hamburg: Unesco Institute for Education. Husin, T. (ed.) (1967) International study of achievement in mathematics, vols I and II. Stockholm: Almqvist & Wiksell; New York: Wiley. IEA (1973) An empirical study of education in twenty-one countries: A technical report. Thorndike, R.L. (1973) Reading comprehension education in fifteen countries. International Studies in Evaluation III. Stockholm: Almqvist & Wiksell; New York: Wiley.

52

T. Neville United States

Postlethwaite

Thailand

os —< in cs m r"> N «m N? N

Sweden

»CT\ TN SOSOOs00

t©Os00OS f*"! Tj" CO

O S CS Tt-osCS

Scotland

in o x os C cs soN cs m OO ronSosooo m

New Zealand

->!• O inos t^- cs in os O so C CS so — so I SSO

The Netherlands Japan Italy

m in © N in f « cs oo oo «S co in o C Tj*

Iran

os « n 00 00 >n d 00 oô vo' es' vo" es

vo 11 ^ Ov m vq O m fo o ^ vo es 00 O t'- Ov m VO«S «o rn Ó ri »•H 1—(Tt O es' o o es ^

Type of school Type of program Home/Age/Sex

1

«o

r- 11 am

et-* o o «M

•2 o •2 a fe ß 'S T3 -o S 55 E •2 &T3 & £ 8 « c S 3 .2 I I I 8 oo — Ja I - » 'S 'S "«3 J5A * - § 3 13 3 - 8 1 1 1 g f S ï h "^ - 'Sg-a Î Z Z m m P D < m m u

National

evaluation

o 0 1c uu I'S pa

M

systems

rn O h- n CO N »X N O'-i NO Tt T}-* O OO t*NO 00 N r4ON

Total out of 100

o o 3

of educational

Word knowledge

^ ^ O N O r-^ ri- ^ »ri r-^

r-- oo Tj- tJ- no ir> r-< H oo ^ rf Tf ON Tf

Other

on -i oo vq ^t -H" oo -' 00 Tfr o vi vi oo' »-i^^CS-N

I

E £

•o c JS

I s

•al

National evaluation of educational systems

61

Total out of 100

- v£> 1 es »o -H 00 I t»; o O "Ì rn O o\ TM 1 1 \o OO oô rn oô O 00 s fOO OO r» f- 00 OO 3 VI 00 o\

r-

Reading comprehension

O 11 o os O o o O o o o I 00 0\ o es Ö o Ö Ö o (N o O o o Ö l Ö

q o

8

Word knowledge

1 o es so SO o so m o a\ I1 o es so I/S 1 o so' SO ri oô vo Ö VO o vi

p «n

o o j= »

Other

t-» 1 trn 00 m* 1 vd ci O

m es O so O I es »-H VC Ö « 1 Ö t-' es

1 vï

School/Teacher

m so 00 OO o Tt I es 00 so 1 00 o •o 1 rn s oô ci ^ «X OS oô rn ! 00* VO Ò m es ? es es m

«n

Type of school Type of program

00 1 00 o o «rj 00 00 CS o O I VO es 00 r^ 1 rn oô es ri so 00 es es Ö Ö 1 rn es Tt -

es

Home/Age/Sex

so 1 Tf OO o oo »0 r- r-; m OO 1 v>es TÎ 1 vû so 1 W"> TÎ vi SO >o en S •.01). In spite of this difference, the two groups' scores overlapped almost completely. Scores ranged from 9 to 28 for those who completed the course and from 8 to 26 for those who did not. Further inspection of ACT scores shows that while 12 per cent of the students completing the course had scores close to two standard deviations below the mean, 25 per cent of those who withdrew had done so. Clearly, the course put off the less able. This might have been predicted by the many studies which correlate success with expectation and thereby with motivation (Feather, 1966; Kagan, 1968; Brim et al., 1969). Students who have previously done relatively poorly tend to expect to do poorly again, and might well choose withdrawal from a course that appears difficult.

The power of motivation Group differences in class rank in high school were also statistically significant (t = 2.89, >.01). In terms of real difference, however, little is evident, for those who completed the course had a mean percentile rank among their classmates of 62, while those who did not complete the course had a mean percentile rank of 57. It is interesting that these students had all experienced success in their earlier academic careers, though of course the absolute implications of success differed between high schools. Possibly expectations had changed in the course of one or more terms in

224 Abby G. Rosenfield, William Pizzi, Anthony Kopera and Frank Loos college, but it is more likely that motivational differences are the primary basis upon which the decision to remain or not was made.

RESEARCH IN PROGRESS

The power of reading Further investigations will be made of differences between students who successfully complete the course and students who do not. One study currently underway will compare differential reading ability and level of performance, inasmuch as the course as currently constituted is almost wholly dependent upon reading. Should this be found to be a critical variable, alternative methods of data presentation, such as videotape, could be explored. Lessening the number of units to be mastered or lengthening the time permitted for completion is also under consideration. Certain test questions have been found to need revision and this will be done as well. Motivation of student and teacher Theoretical bases for the motivational power of testing must be investigated in ever greater depth, as must their other effects upon learning. But, it can be said that the traditional nemesis of the student-the test-can be used as a rewarding aspect of study. Somehow, tests properly used increase the effects of study. Teachers must be encouraged to use such effective means to maximize their own potential through the maximization of the potential of their students.

REFERENCES Anderson, R.C., & Myrow, D.L. (1971) Retroactive inhibition of meaningful discourse. J. educ. Psychol. Monograph, 62, 81-94. Bloom, B.S., Hastings, J.T., & Madaus, G.F. (1965) Handbook on formative and summative evaluation of student learning. New York: Holt, Rinehart, and Winston. Born, D . G . , & Herbert, W.A. (1971) A further study of personalized instruction in large university classes, J. exp. Educ. 40, 6-11. Brim, O.G., Glass, D.C., Heulinger, J., & Firestone, I.J. (1969) American beliefs and attitudes about intelligence. New York: Russell Sage Foundation. Feather, N.T. (1966) Effects of prior success and failure on expectations of success and subsequent performance, J. Pers. soc. Psychol., 3, 287-98.

Testing as teaching

225

Ferster, C. B. (1968) Individualized instruction in alarge introductory psychology course. Psychol. Rec., 18,521-32. Frase, L.T. (1970) Boundary conditions for mathemagenic behaviors. Rev. educ. Res., 40, 337-47. Gagné' R. M. (1965) The conditions of learning. New York: Holt, Rinehart, and Winston. Green, B. A. A self-paced course in freshman physics. Occasional Paper No 2, Education Research Center, Massachusetts Institute of Technology, Cambridge, Mass., 02139. Guthrie, J. T. (1971) Relationships of teaching method, socio-economic status in concept formation. J. educ. Psychol., 42, 345-51. Johnston,J.M. & Pennypacker,H.S.(1971) A behavioral approach to college teaching. Amer. Psychol., 26,219-44. Kagan, J. (1968) On cultural deprivation. In Glass, D.C. (ed.), Biology and behavior: Environmental influences. New York: Russell Sage Foundation. Keller, F.S. (1966) Engineering individualized instruction in classroom teaching. Paper presented at the meetings of the Rocky Mountain Psychological Association, Albuquerque, New Mexico. Keller, F.S. (1968) Goodbye, teacher... J. appl. beh. Anal., 1, 79-89. Kirkland, M.C. (1971) The effects of tests on students and schools. Rev. educ. Res., 41, 303-50. Koen, B.O. (1970) Self-paced instruction for engineering students. Engin. Educ., March, 735-36. Krech, D., Crutchfield, R.S., & Livson (1969) Elements of psychology, 2nd ed. New York: Alfred Knopf. Lloyd, K. E., & Knutzen, J.J. (1969) A self-paced programmed undergraduate course in the experimental analysis of behavior. J. appl. beh. Anal., 2, 125-33. McMichael, J.S. & Corey, J.R. (1969) Contingency management in an introductory psychology course produces better learning. J. appl. beh. Anal., 2, 79-83. Moore, J.W., Hauck, W.E., & Gagné, E.D. (1973) Acquisition, retention, and transfer in an individualized college physics course. J. educ. Psychol., 64, 335-40. Morris,C.J.,&Kimbrell,G. McA. (1972)Performance and attitude effects of the Keller method in an introductory psychology course. Psychol. Rec., 22, 523-30. Roderick, M.C., & Anderson, R.C. (1968) A programmed introduction to psychology versus a textbook-style summary of the same lesson. J. educ. Psychol., 59, 381-87. Rothkopf, E.Z. (1978) The concept of mathemagenic activities. Rev. educ. Res., 40, 325-36. Scriven, M. (1967) The methodology of evaluation. AERA monograph series on curriculum evaluation, No. 1,39-83. Skinner, B.F. (1965) Reflections on a decade of teaching machines. In Glaser, R. (ed.), Teaching machines and programmed learning, II. Washington, D.C.: National Educational Association. Tiedeman, H.R. (1948) A study in retention of classroom learning. J. educ. Res. 41, 516-41. Tobias, S. (1968) The effect of creativity, response mode, and subject matter familiarity on achievement from programmed instruction. New York: MSS Educational Publishing Company. Tobias, S. (1973) Sequence, familiarity, and attribute by treatment interactions in programmed instruction. J. educ. Psychol., 64, 133-41. Wirters, D.R., & Kent, G.W. (1972) Teaching without lecturing: Evidence in the case for individualized instruction. Psychol. Rec., 22, 169-75.

PART 4:

Testing to fit new teaching techniques: Grouping and individualization

Many new educational methods aim at adapting teaching to fit individual students. Streaming and other ability grouping plans, programmed instruction and other kinds of individualized methods require new approaches to evaluation. New definitions of aptitude for learning and new ways of thinking about the interaction of aptitude with instructional methods are needed.

RICHARD

E. S N O W

16

Stanford University1

Aptitude-treatment interactions in educational research and evaluation

The topic of this plenary session is 'Testing to fit new teaching techniques: Grouping and individualization'. I draw three implications from this title. First, that testing must be designed not just to predict who will do well with existing teaching techniques, or to determine who has done well with them, but to evaluate the appropriateness of new techniques for different students, and maybe even to suggest what new kinds of teaching techniques need to be developed to suit different students. Secondly, that individual students do differ from one another substantially enough that some kind of subgrouping of students or individualization may sometimes (perhaps often) be necessary. These two implications in turn suggest a third: that individual differences among students must somehow interact with alternative instructional methods. My role in leading off this session is first to introduce you to the general problem and possibilities of interaction between student characteristics and instructional methods, then to review briefly some of the research on this problem, and to draw along the way some implications from this work for the future of educational research and development, particularly its measurement and evaluation aspect. It will simplify my task to refer to any interactions between student pretest characteristics and instructional method variables as instances of aptitude-treatment interactions, or ATI. The term 'aptitude' refers to any individual difference variable that predicts response to instruction in some treatment. General and specific

1. This paper was prepared while the author served as Boerhaave Professor, Medical Faculty, University of Leiden, Netherlands. The support of the Leiden Medical Faculty is gratefully acknowledged.

230 Richard E. Snow cognitive abilities, prior achievements, personality and stylistic characteristics, motivational and attitudinal tendencies are all potential aptitudes in this sense. Treatment variables can be any instructional comparisons, including alternative curriculum or instructional methods, minor variations in instructional sequencing, or different teachers, clinical or counseling procedures, or even different classroom or institutional environments. The general designation ATI allows all interactions between variables from these two classes to be categorized as similar phenomena, or at least potentially similar phenomena.

THE BASIC CONCEPT OF ATI

The history and development of ATI research has been described in several previous publications (Cronbach & Snow, 1969; Snow, 1972). Only a brief summary is needed here. Throughout the history of educational research, and of all psychological research in fact, there have been two basic approaches. First, the experimental approach has tried to find improved instructional methods or media or materials by comparing alternative instructional treatments in experiments. Students are randomly assigned to treatments - call them A or B - and these are compared on some measure(s) of learning outcome. The treatment showing the highest average outcome is judged superior. Individual differences among students within a treatment are considered to be due to chance and ignored. The approach is represented in Figure 1 a, where A and B designate the two treatment averages. The other, correlational, approach ignores variations in instructional treatment, assuming treatment to be institutionally given. It seeks improved measures of student aptitude for instruction, to predict which students would most likely profit in the given treatment. Those aptitude variables showing strong relation (steep regression slope) to achievement are used to select students for instruction, thus raising average outcome. Figure 16 shows the kind of result sought by this approach. In 1957, Cronbach (1957) argued that psychology and all its applications would not progress substantially until the two approaches were combined. In educational settings, for example, it was quite possible that aptitudes and treatments would interact, i.e., the relation of aptitude and outcome might be quite different in different treatments (as suggested in Figure lc). If this were so, it would be possible to build alternative treatments suited to

Aptitude-treatment interactions in educational research and evaluation 231 Figure 1. Data of interest in experimental, correlational, and interactional research (a) Learning Outcome

Treatment A Average Treatment B Average

Treatment A (b)

Learning Outcome

Low

Aptitude

High

(c) Learning Outcome Treatment B

Low

Aptitude

High

the needs and characteristics of different kinds of students, improving overall outcome by assigning students to treatments on the basis of ATI. Considering Figures 1 b and lc together you can see that a student high in this particular aptitude is better off in treatment A, while a student low in this aptitude is better off in treatment B.

232

Richard E. Snow

Another way of saying all this is to say that tests have traditionally been used for two purposes in education. One is to assess individual differences in students relevant to learning; the other is to evaluate instructional treatments to determine which is best for learning. These two purposes must be combined. We cannot determine which student differences are relevant for learning without examining alternative instructional treatments, and we cannot evaluate alternative instructional treatments without examining individual differences among students. Consider three possible outcomes of an educational experiment that pays attention to both aptitude and treatment variation. In Figure 2a the two regression slopes are nonparallel so there is ATI. The point at which the two slopes cross can be used to assign students differing in aptitude to treatment A or B. In Figure 2b, the two slopes are nonparallel so again there is ATI. Here no clear classification rule is available but assignment of different students to different treatments is still worthwhile if treatment A is more costly than B or requires some quota limit. Note that the traditional experimental approach, looking only at averages for the data in Figure 2a, would conclude that there was no difference between treatments - an incorrect generalization. This approach would also conclude, in Figure 2b, that treatment A was better than B for everyone also an incorrect generalization. The traditional correlational approach would look at average or pooled regression slopes in Figures 2a and 2b concluding that the predictor was not valid and should be discarded also incorrect. Only data like those depicted in Figure 2c fit the traditional experimental and correlational approaches. Here, the slopes are parallel so average outcomes and pooled regression lines do offer valid generalizations. Parallel regression slopes are probably quite rare, so we should expect some degree of ATI as a frequent research result. Actually, there are important theoretical and practical reasons for looking for ATI. It has already been demonstrated, in Figure 2a, how ATI can be used to assign students to different treatments, thereby improving learning outcome for everyone. Further, it can be argued that all attempts to individualize instruction rest implicitly or explicitly on ATI hypotheses, yet rarely are these hypotheses actually tested and acted upon in designing or revising such instruction. ATI de-emphasizes the old view of selecting students to fit existing instructional methods and institutions in favor of attempts to design different methods and institutions to help the learning of different kinds of students. ATI reopens for new consideration all the old compari-

Aptitude-treatment interactions in educational research and evaluation Figure 2. Varieties of regression analysis results Assign to B

(a)

Assign to A Treatment A

Learning Outcome

Treatment B

Low

Aptitude

(b)

Learning Outcome

Treatment B Low

Aptitude

High

(c) Treatment A

Learning Outcome

Treatment B

Low

Aptitude

High

234 Richard E. Snow sons of instruction as meaningful v. rote, didactic v. inquiry-orientated, teacher-centered v. student-centered, etc. ATI is the best demonstration of construct validity; it shows that a construct's relation to other variables can be manipulated experimentally and is thus understood to some degree. And, finally, ATI may prove to be a key to testing hypotheses about intervening variables in S-R or cognitive theories of learning. To examine the role of some mediating activity between stimulus and overt response, for example, it should be possible to build an independent measure of individual differences in the presumed mediating process, then to show that the measure relates to performance in experimental conditions which require that mediating process and is unrelated to performance in conditions where that mediator is not needed.

SOME GENERAL RESEARCH FINDINGS

While these arguments may be convincing in the abstract, it would be all for naught if real-world empirical results rarely showed ATI. Some early reviews of ATI literature were in fact quite pessimistic, based on studies then in hand (for example, Bracht, 1970). But closer examination of the methodological problems in ATI research and continued assimilation of recent studies has suggested quite strongly, I think, that ATI phenomena are widespread (Berliner & Cahen, 1973; Cronbach & Snow, 1969, in press). However, it is true that ATI findings are not readily replicable; many appear quite unstable, and most are poorly understood. It will be a long time before many practical applications are firm, but enough instances are now in hand to argue that ATI considerations should accompany virtually all educational psychological research. Even the investigator who has no substantive interest in the psychological nature of aptitude must consider the inclusion of aptitude variables and the analysis of ATI to justify the generalizations he desires to make about instructional treatments. It is argued elsewhere (Snow, 1973) that ATI is one of three fundamental methodological considerations basic to generalization in educational science. (Space will not be used here to discuss the other two.) It is clear also that the educational evaluator who is interested not so much in generalizing but in improving a particular treatment will often learn a great deal about the functioning of his product by examining ATI as part of the evaluation. It may be helpful to discuss briefly some of the areas where ATI have

Aptitude-treatment interactions in educational research and evaluation

235

been studied and to try to reach some summary statements. We can now list a number of tentative and still vague hypotheses, based on a review of several hundred studies. These statements are not to be taken as firm conclusions, but rather as rough guidelines for further work. They are tentative and vague because most of the research on which they are based contains methodological weaknesses and inconsistencies, and does not provide much understanding of the underlying psychological mechanisms that may give rise to ATI. But they are beginning to add up at several points and they do give a base from which more penetrating studies can be launched. I will not burden you with citation of all individual studies here. The suggestions are documented and described in more detail in a forthcoming book (Cronbach & Snow, in press). (1) A first conclusion must be that general mental ability is well characterized as 'ability to learn' in many situations. Its relation to learning outcomes is sometimes strong, sometimes weak, sometimes even negative, but never to be discounted. This does not imply that ability to learn is unitary but it is true that special abilities such as those proposed by Guilford (1967) only rarely yield ATI that demand special ability interpretation. In subsequent statements I will refer to students differing on general ability only as 'highs' and 'lows', though obviously these labels are relative, not absolute. (2) The sheer weight of the evidence comparing programmed instruction with various kinds of conventional teaching suggests that lows benefit more from programming than they do from conventional treatment. However, since a few studies show the opposite, i.e., that programming is best for highs, the programming of any given lesson may or may not benefit lows, depending on additional variables yet to be identified. At least it is clear, though, that the old claim that programmed instruction would do away with individual differences can be refuted, and I would include the same claim made for mastery learning in this refutation. Within programmed instruction, the many possible minor variations, like size of step or branching conditions or overtness of response, lead to some ATI findings but no clear conclusions. There is evidence that the pace of instruction should be matched to the preferred tempo of the student. (3) If instruction provides or encourages meaningful interpretation of content, highs are helped and an extra burden is often placed on lows. But 'meaningful' instruction, as opposed to rote learning, may not be meaningful for lows because they lack interpretative and mediational skills. This does not necessarily suggest that rote learning is best for them,

236

Richard E. Snow

but rather that other means yet to be devised are needed to make instruction meaningful for lows without being burdensome. Some evidence exists that highly verbal and abstract conceptual treatments are particularly bad for lows, while simple diagrams, figures, and symbolic constructions can be used to replace or supplement abstract interpretations to benefit lows. These additions are typically unnecessary and sometimes even harmful for highs. There is also the interesting suggestion, coming from a small laboratory study (Dunham & Bunderson, 1969) that memory or reasoning abilities are brought into play by the extent to which a treatment does or does not provide rules for organizing information. When rules were given to aid concept learning, success depended on general reasoning ability. When rules were not given, success depended on a separate rote, associative memory ability. (4) There is weak evidence that advance organizers in the form of preliminary abstracts or summaries help lows and may hurt highs. Perhaps, giving students new schema provides a needed framework for lows, but conflicts with idiosyncratic schema already used effectively by highs. (5) In a similar vein, some research shows that demonstrations and training on a cognitive skill the learner can use in a task helps lows and hinders highs. (6) Results on the value of inserting periodic questions into text or televised lessons are quite inconsistent. A number of ATI results have been obtained in this area, but the studies (including attempts at replication) have frequently contradicted one another. No summary hypothesis is possible at this time. (7) Simple comparisons of visual-spatial treatments and verbal treatments, using the verbal-spatial ability distinction also as aptitude, have led to nothing so far. This was an early favorite ATI hypothesis that has failed regularly. Here is an area where deeper, more penetrating task analysis is clearly needed, since a treatment is not visual simply because it uses pictures. There is some reason to believe that tasks requiring spatial analysis and covert manipulation of images are good for students with high spatial ability, while low students benefit from simple figures that clarify meaning. (8) Previous experience with a method or medium of instruction helps the student leain new content in later situations where the same method or medium is used. This appears to be a special case of aptitude development, noted primarily in research on filmed or televised instruction where it is referred to as 'media literacy'. Apparently, aptitude functioning

Aptitude-treatment

interactions in educational research and evaluation

237

here is a transfer or learning-to-learn phenomenon that may require a new approach to aptitude measurement, perhaps using simulations or work sample tests. The suggestion also arises here that aptitude should be assessed periodically during the course of instruction so that aptitude development for particular treatments can be recognized. (9) Among the large studies designed to evaluate new curriculum developments in science, mathematics, etc., there are often ATI findings. These are not entirely consistent, but they suggest at least that highs do not differ much from treatment to treatment but that choice of curriculum is critical for lows. (10) In the reading curriculum area, there have been several comparisons of phonics methods with 'look-say' or whole word methods showing ATI. The most likely hypothesis is that stronger relations between ability and outcome occur with phonics. Phonics thus seems best for highs, while neither phonics nor conventional whole-word treatments are particularly good for lows. But there is one large study that shows the opposite ATI (i.e., phonics best for lows) and a few studies that show no ATI, so the issue is far from decided. It is worth noting that the phonics v. whole word decision has been a raging controversy for years, with each method having its strong advocates. The ATI results suggest that neither group is all right or all wrong. This is probably the case in most other such controversies. (11) Many studies can be classed as comparisons of didactive v. discovery or deductive v. inductive treatments. The results are fairly consistent: inductive treatments give steeper positive slope for general ability. Inductive methods are better for highs while for lows there is usually little difference between methods; neither is very good. (12) It appears that the characteristics of college environments interact with student ability and attitudinal characteristics. Classroom climate variables in secondary school have also been shown to interact with student ability. Not enough work has been done here, however, to formulate likely ATI hypotheses. (13) Among studies that have investigated cognitive style variables as aptitudes, there are two intriguing hypotheses. With level of conceptual complexity (measured by paragraph completion tests) as aptitude, lows do better in structured treatments, like lectures, rule-example sequences, directive-teaching-stituations, etc. Highs are better off in unstructured treatments, those that are more student-centered and inductively oriented. Also, a review by Witkin (1972) has suggested that matching teachers and students on similarity of style, using the concept of field

238

Richard E. Snow

independence-dependence, serves each kind of student better. So far, however, there is not strong evidence for either of these style hypotheses. There is doubt about the interpretation of conceptual level and field independence as stylistic aptitudes distinct from abilities. Cattell's (1963) definitions of general verbal, or crystallized, ability and general analytic, or fluid, ability seem close to what is meant by conceptual level and field independence, respectively. And there are parallel ATI findings using ability variables. One study showed more structured teachers to be better for low ability students, with permissive teachers more effective with high ability students. This is similar to the conceptual level-structure hypothesis. Other research has suggested that benefits derive from matching teachers and students on ability patterns and on behavioristic v. humanistic value orientations about the subject-matter (which incidentally was introductory psychology). It is not hard to imagine field independencedependence as a distinction embedded in these contrasts. Thus, here is an important area for the application of ATI research to determine construct validity. Are style constructs really necessary? Cognitive styles may or may not represent a new class of individual difference variables and a link between the ability and personality domains. If they do, however, then they bring forward a new conception of aptitude and a need for new kinds of aptitude measurement in education. ATI research with these variables is only now getting started, but it seems to be a highly promising attack. This brings us to the question of personality variables as aptitudes. There are many interactions here that could be discussed. There is reason to believe that more ATI results will be forthcoming in the personality domain than with abilities as aptitudes. Here, however, summary statements are impossible at present. The number of apparently different personality dimensions that have been investigated in ATI studies is enormous, but their meaning is uncertain and the methodological problems are severe. I have chosen to list a few illustrative findings but I can give no assurance that these results will be substantiated by further research. (14) There are ATI findings that indicate that high anxious students need structure (with teacher directiveness, support, and feedback) but are hurt by the addition of any kind of stress. Low anxious students, on the other hand, are helped by the addition of mild stress; they function well where the situation is less structured, where teachers are more permissive and participative. There is an added indication that student compulsivity

Aptitude-treatment

interactions in educational research and evaluation

239

interacts in a similar way and that student perception of the teacher as directive or participative provides ATI even where teachers use a mixture of these styles. It is also likely that student anxiety enters higher-order interactions with ability and these treatment variables, particularly as content varies in difficulty. (15) Students high in extraversión appear to perform well in inductive and discovery situations, while didactic instruction is better for more introverted students. Also, discouragement seems to stimulate extraverts while hurting the performance of introverts. Encouragement appears to have the opposite effects. (16) High sociable students have done better in lecture-discussion treatments as opposed to small groups and better in live situations as opposed to canned or taped presentations. These differences were not noted for low sociable students. There is some evidence that students high in sociability and low in test anxiety work better in pairs than alone but that students low in sociability and high in test anxiety are better off alone. This hypothesis is probably overly simple, however, because there is other evidence that ability and sex moderate this kind of ATI. (17) A well-founded hypothesis derives from studies contrasting student achievement via independence and achievement via conformity, as an aptitude, and the extent to which college-teachers encourage independence v. conformity. Students do best with teachers who encourage their preferred style. A similar result comes from work relating teacher arousal of motivation for achievement to need for achievement in students. (18) Comparisons of filmed v. live lecture demonstrations have suggested that film is best for students high in responsibility and/or low in ascendancy while live presentation is best for students with the opposite pattern. (19) Finally, one complicated study has suggested some other types of classification variables of use in ATI work. Students were classified as strivers, docile conformers and opposers. Teachers were classified as spontaneous, orderly, or fearful in their classroom behavior and further as generally superior or inferior in performance. Striving students did well with spontaneous-superior or orderly teachers but poorly with fearful teachers. Docile conformer students did well with spontaneous or orderlysuperior teachers and poorly with fearful or spontaneous-inferior teachers. Opposers did poorly in general but best with orderly teachers.

240

Richard E. Snow

SOME SPECIFIC EXAMPLES

Having given this summary of ATI findings, it is appropriate to touch briefly on three particular studies that can serve as examples of ATI thinking in educational research and evaluation studies. The first is perhaps the largest study of programmed instruction ever undertaken. This is the US Office of Economic Opportunity's try-out of performance contracting. The study included some 25,000 students in Grades 1, 2, 3, 7, 8, and 9 (corresponding roughly to ages 6, 7, 8, 12, 13, and 14) from 18 geographically scattered school districts in the US. In each district, one of six private companies specializing in educational technology contracted to provide remedial-reading and mathematics instruction to the lowest achievers. The companies used instructional procedures and materials of their own design, but all chose some form of programmed instruction with incentives for achievement. An independent research agency evaluated the program (Ray, 1972). In each school district, the school with the largest deficiency in reading and mathematics was designated as 'experimental'; the next most deficient school was called 'control'. In each school, attention was given to the 100 students in each grade whose initial achievement was lowest. Assignment to treatments was thus not random, and pretest differences between experimental and control groups appeared at many sites. Instruction continued for a full school year, with pretests at the start and post-tests at the end of the year. Commercially available standardized achievement tests were used. At almost every site there were strong aptitude main effects, with prepost correlations usually above 0.50. There were scattered treatment effects, sometimes favoring experimental groups and sometimes favoring control groups. The US Office of Economic Opportunity used these overall results to issue a report concluding that performance contracting instruction was not better than conventional instruction, on the average. Fortunately the independent evaluation did not stop there. ATI regressions of post-test on pretest for each grade-site-content combination were computed. There were 232 statistical tests of ATI; of these many showed an ATI pattern and 40 of these were statistically significant. Only about 12 would be expected by chance. Of the 40, 17 were clearly disordinal with the experimental treatment superior for lows and the control treatment superior for highs. But 14 other ATIs were disordinal in the opposite direction, with experimental treatment better for highs and worse

Aptitude-treatment interactions in educational research and evaluation 241 tor lows. There was no attempt to explain why ATI took different forms in different classes, but at least this is a clear demonstration that overall averages do not yield as informative an evaluation as does the addition of ATI considerations. Another large-scale evaluation of an Office of Economic Opportunity program shows a striking ATI. The program was called 'Upward Bound'; it consisted of intensive summer courses to prepare underprivileged students for college entrance. The evaluation was conducted by Hunt and Hardt (1967). Hunt is the developer of the cognitive style variable mentioned earlier as conceptual level. Conceptual level represents a person's integrative complexity, interpersonal maturity, and degree of abstractness. Highs show lower stereotypy and greater flexibility in complex situations, more exploratory, creative activity and more tolerance for stress. Hunt hypothesizes that conceptual level will interact with degree of structure or structural complexity in the environment, with lows needing more structure - highs needing less. In his evaluation of Upward Bound, there were 1622 students in 21 different summer programs (a 10 per cent national sample of all such programs). Using a program climate questionnaire, with questions like 'when the students make a suggestion, the program is changed', the authors classified programs as predominantly structured or predominantly flexible. This particular split was justified on theoretical grounds, but also because it appeared to be a major dimension of program variation. Independently, each program was classified as having predominantly high CL or low CL students (CL designating conceptual level), using a paragraph completion test, about students' thoughts on rules, criticism, etc. This is Hunt's standard method for scoring conceptual level. Thus, the program served as sampling unit. The evaluation design had used change scores on nine measures of program effectiveness. The authors report results only for those six dependent variables on which significant overall increases were found pre- to post. For four of these, ATI was significant, with high CL programs better if flexible and low CL programs better if structured. The same interactions occurred with two other criteria, though these were not quite statistically significant. One measure failed of ATI but showed treatment effect favoring structure. The analysis was also conducted using S as the sampling unit. Here, the same ATI patterns were present, but all statistical tests failed of significance. Again, it should be clear that ATI provides a unique kind of evaluation. It suggests how a program like Upward Bound might be improved through

242

Richard E. Snow

individualization, with different programs designed to suit different students A final example comes from the research comparing phonics with wholeword approaches in primary reading. It provides some feeling for the difficulties involved in hypothesizing about ATI. It serves as well to suggest how the problems of general v. special ability and of aptitude development might best be viewed. Stallings (1970) hypothesized that learning by phonics (PM) requires 'sequencing' ability or sequence memory - the ability to construct and reconstruct strings of letters and sounds from short-term memory. Children deficient in this skill might not learn well from PM, and might show anxious and avoidant reactions as well. Those with sufficient sequencing skill could be expected to do quite well. Stallings reasoned that the whole word method (WWM) avoids the sequencing requirement and would be preferable for children low in this ability. But the opposite hypothesis could be argued. If sequencing ability is essential for effective learning, PM might develop this ability by providing practice in sequencing. Word recognition may require the ability even if the WWM learner is never shown how sequences are produced. Ss would do well in WWM only to the extent that they already possessed this skill. Hence a positive relation between ability and WWM might be predicted; lows would be expected to do better in PM and perhaps worse in WWM than their more able peers. To examine these competing hypotheses, three studies were conducted. Two experiments were pilot studies, each with only 20 Ss randomly divided among treatments. The third was a large-scale study in the public schools; here random assignment was not possible. The pilot studies were conducted in successive years in a small, private elementary school. Each year, the same procedure was followed. The aptitude measures were new auditory and visual sequencing tests, and the related scales from the Illinois Tests of Psycholinguistic Abilities (ITPA). After aptitude scores were collected, 20 first-graders were divided to form two comparable groups. For the first two months of school, one group was taught using PM materials, while the other group was taught by WWM. At the end of the two months, the California Achievement Test in reading (CAT) was administered to provide criterion information. As another criterion measure, all Ss were observed periodically using a checklist of behavior suggestive of 'learning avoidance'. Indicators included excessive fidgeting, distracting neighbors, fighting, fooling, chairrocking, etc.

Aptitude-treatment interactions in educational research and evaluation 243 There were no average differences for treatment. While sample size is very small in the pilot studies, some highly suggestive ATI results emerged. Frequency of learning avoidance was differently related to ITPA visual sequencing skill in the two treatments. The result was consistent across the two pilot years, providing a clear replication. Students with high visual sequencing scores showed more learning avoidance in PM; those with low scores showed more learning avoidance in WWM. Stallings' own visual sequencing test did not provide a similar replication, due to inconsistency of the relation in PM but the interaction was significant in the second year and corresponds to the pattern found for ITPA visual sequencing that year. For the auditory sequencing measures, scattered interactions appear using CAT as criterion but these are not consistent. In the public school study, six PM classrooms used a new curriculum developed by the school system. Six other classrooms were assigned to the older WWM curriculum. Classroom visits confirmed that teachers adhered more or less to the treatment assignment. Again, aptitude measures were taken at the start of the school year. Observations of learning avoidance behavior during reading were collected in September, November, and January. January testing also included the CAT, a reading achievement test based on phonics, and repetition of the sequencing aptitude tests. Both visual sequencing tests yielded significant ATI roughly consistent with that found in the pilot studies, when learning avoidance served as dependent variable. Highs showed more learning avoidance with PM, while lows showed more avoidance behavior with WWM. Also, both auditory sequencing tests and Stallings' own visual sequencing test as well, showed significant ATI on CAT and the phonics achievement test. All slopes were positive, with those for PM distinctly more so. Highs do better in PM than in WWM and, despite the main effect favoring PM, there are several crossovers. This result is not consistent with the pilot data, where several negative slopes were obtained for PM. However, we should consider the larger, public school study as deserving more weight. Simple regression analyses cannot make parsimonious sense of such a complex of findings, so the data were reanalyzed using multiple regression with fewer, summated variables. A general aptitude was defined as the sum of the four principal pretests, and a special aptitude was defined as the sum of the visual pretests minus the sum of the auditory pretests. The results were as follows: For CAT, general sequencing ability served as a potent predictor regardless of treatment, but the interaction with

244

Richard E. Snow

treatment was significant. The difference between auditory and visual sense modalities added little. Thus, the simple ATI effects reported earlier are to be interpreted as results of a more general aptitude, not skills specific to a particular sense modality. For the linguistically based achievement test, the same ATI for general sequencing ability held. Here though, differential ability provided additional significant prediction and appeared to offer some interaction with treatment beyond this. Learners with strong auditory skill, relative to visual skill, do well especially in PM. Learning avoidance was not well predicted in general. The data suggest that the ATI reported earlier arose from the functioning of differential ability. Learners with strong visual skill, relative to auditory skill, show more avoidance activity in PM. The reverse is true for WWM. There is another ATI pattern in the public school data which, though not initially hypothesized, is notable for its implications concerning aptitude development. The four principal aptitude tests were given both pre- and post-treatment. There was no average treatment effect, but there did appear to be ATI when an aptitude pretest using one sense modality was used to predict outcome on an aptitude post-test of the other sense modality. Of the eight possible ATI using pre- and post-tests of different sense modalities, five were statistically significant. None of the intramodality regression tests achieved significance. Some of these ATI patterns appeared to have existed already at the time of the pretest so it is not clear that the treatments produced these results. But it is striking that the effect of PM was in each case to magnify the existing ATI pattern by producing a steeper slope, pre-post, than had existed pre-pre, while WWM had little effect on the existing, relatively horizontal pre-pre slopes. The implication of this is that PM develops aptitude as well as achievement while WWM does not. Thus, the conclusion seems to be that PM makes the strong stronger and the weak weaker, relative to WWM, not only in reading achievement but also in aptitude needed for further reading achievement. PM may also produce a more consolidated auditory-visual skill, perhaps by promoting transfer. In the process, however, it also produces a high frequency of learning avoidance activity in able learners, while WWM has this effect for the less able. The Stallings' study suggests not only that ATI can be used to advantage in individualizing reading instruction but also that it can help in the analysis of the nature and development of aptitude. Its multiple regression methodology also illustrates a means of deciding whether special ability

Aptitude-treatment interactions in educational research and evaluation

245

constructs are in fact needed in interpreting ATI or whether a more parsimonious conclusion based on general ability can suffice in any given case.

SUMMARY

If these kinds of findings are representative of what we may expect to find pursuing ATI further, then there must be a radical change in the methodology of educational research and evaluation. No adequate theories of learning and instruction can be developed without taking individual differences into account. No one best method of instruction for all students will ever be found. Evaluation studies will need to revise treatments by differentiating them. These differentiations, as well as existing attempts at individualization, like homogeneous grouping or programmed instruction, will have to be justified by ATI data, and new conceptions of aptitude should also emerge. Glaser (1972) and others have recently emphasized the importance of task analyses of learning and aptitude for learning in detail, to define new kinds of aptitude specific to learning tasks. These analyses should seek a common language for describing aptitude and learning phenomena, perhaps using information processing concepts. Perhaps these new conceptions of aptitude will rely on stylistic and strategy distinctions in addition to traditional tests of aptitude. Perhaps they will be defined from student introspections about learning habits in specific subject-matters. Even among traditional ability tests, new distinctions seem likely to clarify the nature of aptitude; Cattell's (1963) distinctions between fluid and crystallized intelligence is one promising suggestion for futher ATI work, for example, since no one to my knowledge has ever considered what kind of teaching might be good for students high in fluid ability but low in the more typically measured crystallized ability. Finally, it is likely that aptitudes will increasingly be regarded also as outcomes of instruction, so new measures will be needed to diagnose transfer and learning-to-learn as outcome evaluations as well as to tap new kinds of aptitude as input. Whether any given aptitude is modifiable or not is an empirical question. If it is easily modifiable, instructional treatments can be adapted to improve it. If it is not, instructional treatments must be adapted to fit it. Either way, ATI considerations will play a fundamental role in evaluation.

246

Richard E. Snow

One last general point can serve as summary. ATI is actually only the application of Darwinian thinking to education. Different learners require different learning environments - some will thrive in one environment and fail in another. Some others will be better off in the second environment but will be unable to live and prosper in the first. It will be our task to design the aptitude measures and the instructional treatments that will help capitalize on this phenomenon for the benefit of all learners.

REFERENCES Berliner, D.C. & Cahen, L.S. (1973) Trait-treatment interactions in learning. In Kerlinger, F.N. (ed.) Review of research in education. Itasca, III.: Peacock. Bracht, G.H. (1970) Experimental factors related to aptitude treatment interactions. Rev. educ. Res.,40,627^5. Cattell, R.B. (1963) Theory of fluid and crystallized intelligence: A critical experiment. J. educ. Psychol., 54,1-22. Cronbach, L.J. (1957) The two disciplines of scientific psychology. Amer. Psychol., 12, 671-84. Cronbach, L.J., & Snow, R.E. (1969) Individual differences in learning ability as a function of instructional variables Final Report. U.S. Office of Education. Contract DEC 4-6-06129-127. School of Education, Stanford University, Stanford, Calif. Cronbach, L.J., & Snow, R.E. (in press) Aptitudes and instructional methods: A handbook for research on interactions. N.Y.: Appleton-Century-Crofts. Dunham, J.L., & Bunderson, C.V. (1969) Effect of decision-rule instruction upon the relationship of cognitive abilities to performance in multiple-category concept problems. J. educ. Psychol., 60, 121-25. Glaser, R. (1972) Intelligence, learning and the new aptitudes. Learning Research and Development Center, University of Pittsburgh. Guilford, J.P. (1967) The nature of human intelligence. New York: McGrawHill. Hunt, D.E., & Hardt, R.H. (1967) The role of conceptual level and program structure in summer Upward Bound programs. Paper presented at the Eastern Psychological Association, Boston. Ray, H.W. (1972) Final report on the Office of Economic Opportunity experiment in educational performance contracting. Battelle Laboratories, Columbus, Ohio.

Aptitude-treatment interactions in educational research and evaluation

247

Snow, R . E . (1972) Personal-intellectual differences and individualized alternatives in higher education. Paper presented to the G R E Board Invitational Conference on Cognitive Styles and Creativity in Higher Education, Montreal, Canada. Snow, R . E . (1973) Representative and quasi-representative designs for research on teaching. Paper presented to the American Educational Research Association, New Orleans, Louisiana. Stallings, J. A. (1970) Reading methods and sequencing abilities. Unpublished doctoral dissertation, Stanford University. Witkin, H . A . (1972) The role of cognitive style in academic performance and in teacherstudent relations. Paper presented to the G R E Board Invitational Conference on Cognitive Styles and Creativity in Higher Education, Montreal, Canada.

GAVRIEL

SALOMON

17

The Hebrew University of Jerusalem

ATI research: For better psychological insights or for better educational practice?

The combination of two research disciplines - the differential and experimental - has resulted in the Aptitude-Treatment-Interaction (ATI) model. This model appears to have done great service to educational testing and to educational research. It is showing, however, the first signs of a discrepancy between two tendencies growing within it which pull in two different directions. This is a discrepancy between ATI research which aims at better understanding of how individual differences affect learning in specific types of environments (see for example Berliner & Cahen, 1972), and ATI research which aims at better adaptation of instruction to individual differences (Cronbach, 1967). Since each kind of ATI research is differently construed, serves different goals and fulfils different expectations, confusing the two - as is commonly done - may be quite harmful. It is to this discrepancy that I want to address myself.

THE CONTRIBUTION OF THE ATI MODEL

What did the ATI paradigm contribute to the field? First, it taught us what questions not to ask, namely those dealing with the elusive 'best' instructional treatments. Indeed, with the increasing number of ATI findings one feels less and less inclined to overlook the existence of relevant individual differences. Complex, and far more representative, interactions are sought after. Secondly, ATI led us to clarify the ill-defined concept of 'individualization'. New modes of individualizing instruction, far removed from those

Better psychological insights or better educational practice?

249

developed in the tradition of programmed instruction 1 , could be suggested. Thirdly, in light of the ATI approach, the measurement of individual differences assumed a new role. Until ATI came around there was very little indeed that researchers in the field of instruction could do with the large quantities of differential data. Testing served mainly for the evaluation of achievements, or for diagnosis. Once diagnosis was accomplished, however, there were no clear ideas as to how a student should be taught, and what kind of treatment would serve him best. Finally, and most important, the search for ATIs forced us to study instructional treatments in terms of the psychological functions they accomplish for different types of learners. The sheer measurement of effects, as inferred from the comparison of mean achievement scores, was not enough any more. One felt the need to gain insight into the whys and the hows of instruction. Thus, the researcher had to justify why treatment 'A' was supposed to be more beneficial for one type of learner and why treatment 'B' was more beneficial for another. A number of possible psychological functions accomplished by treatments, such as compensation, preference and remediation, could therefore be suggested (Snow, 1970; Salomon, 1972a). This is perhaps the greatest achievement of ATI research: It turned our attention away from the simplistic operational descriptions and definitions of treatments and refocused it on underlying psychological functions, as so strongly advocated by Melton (1964).

THE NATURE OF THE DISCREPANCY

In another paper (Salomon, 1972a) I have claimed that while the search for explanatory constructs, that is, the search for psychological functions, is essential for theory construction, it is of little help to the world of education. Hence the discrepancy between theory and practice-orientated ATI research. If ATI work is to deepen our understanding of how and why certain human aptitudes interact with environments, if it is to deepen our understanding of the essential attributes and differential functions of treatments, this work has to be extremely 'clean', bordering on the sterile kind of research. Indeed, it is the internal validity of the studies which is 1. See the critique of 'individualization', as developed in the tradition of programmed instruction, made by Oettinger and Marks (1968).

250

Gavriel Salomon

then the crucial component. It would be quite impossible to reach any acceptable conclusions concerning the whys of obtained ATIs if both the measurement of individual differences and the explication of treatment variables are not extremely stringent. Moreover, to be able to identify a particular psychological function accomplished by a treatment with a particular type of learner, variables have to be purified and stripped to their essentials. How else can one claim that it was one or another specific ingredient which compensated for, say, visual deficiency? When treatments are complex, multi-ingredient composites, the researcher's inferences are somewhat fuzzy, as Tallmage and Shearer (1969. p. 229) for instance: The difference between meaningful rules and arbitrary rules is only one of many differences which existed between the Transportation Technique and Aircraft Recognition Subject-matter areas. Any of these differences could have been responsible for the reversal relationship between learner characteristics and instructional methods. Thus, Shulman (1970, p. 374) was right when he said that 'ATIs are likely to remain an empty phrase as long as aptitudes are measured by micrometers and environments by divining rod'. For not to remain an empty phrase, ATI work has to learn how to deal with environments, or treatments, by means of micrometers. This, in turn, will enable us to better understand how and why environments interact with different aptitudes. This, however, is the cause for the discrepancy between theory and practice-oriented ATI research. The finer our micrometers and the deeper we look into the differential psychological functions of treatments, the farther we move from the real world of education. We may gain in accuracy and psychological insight, but we lose ecological validity and representativeness, as recently discussed and analyzed by Snow (1973). When the latter are absent, no educational decisions can be based on the research. There are two reasons why the ATI research, described above, can serve mainly theory-oriented but not practice-oriented purposes. One reason is that it is much too specific. The other, and related reason, is that it does not take into consideration the complexity of educational practice.

Better psychological insights or better educational practice?

251

AN EXAMPLE

Let us start with the question of specificity with a slightly exaggerated example. Imagine a teacher who wishes to devote his class period to a literary text. He starts out by asking the students to study the text. To be most effective he has to divide his class into a group of good and a group of poor memorizers. The former are asked, following Berliner (1971), to take notes; the latter, to answer test-like questions. However, since the teacher wants his students to detect problems in the text, as well as know it, he will have to divide them again, following Salomon and Sieber-Suppes (1972), into verbally able and less-able groups. The former will get an unstructured, randomly sequenced text, the latter a well-structured one. Following the exposure to the text, students have to solve the problems. To follow Dowalby (1971), the class must now be divided into a highanxiety group - to be given individual, student-centered attention, and a low anxiety group - to be exposed to a teacher-centered approach. After that, the teacher wishes to teach his students how to critically evaluate such a text. He wants them to learn how to raise critical questions. He follows now the work by Koran, Snow and McDonald (1971): he divides the class into a more field-dependent group, to learn question-asking from a written model, and a less field-dependent group to watch a live model. This description carries the argument to absurdity. Yet it illustrates the need to divide and continuously redivide students into different groups whenever instiuction shifts gears, as long as instruction is based on highly specific findings.

THE LIMITATIONS OF GENERALIZATION

But ATI research, as many will claim, is cumulative, and highly specific findings could show more generalizable trends and commonalities. Indeed, to an extent, this is correct. Cronbach and Snow (1969), summarizing a number of ATI studies, pointed out that while for the low ability learner appropriate treatments provide some compensatory-conciliatory aid in stimulus differentiation, the high ability learner learns more when provided with a wide associational latitude. Berliner and Cahen (1973) have pointed to another possible generalization. On the basis of a number of studies which they have surveyed they found that anxiety appeared to interact consistently with variations on

252

Gavriel Salomon

instructional structuredness. Thus, more anxious learners benefited more from a structured environment, while those less anxious learned better in a more open environment. In my own research (see Salomon, 1972b) I find that students with poor mastery of a skill-to-be-learned are more successful when shown very explicit models which visually supplant the skill to be learned. Students with better initial mastery learn the skill better when having to activate it covertly in their minds. As one can note, these three generalizations are in overall agreement with each other. However, such inductions, heuristic as they are, have only a limited value. This takes us to the second reason for the relative detachment of the purified ATI research. As long as the experiment is done under favorable conditions, and treatments are simple and 'pure', the specific aptitudes which interact with them may account for reasonable portions of the learning variance. (In our series of studies, mentioned above, we have accounted for up to 28 per cent of learning variance.) But when the treatments are taken out of the well-controlled laboratory and transferred into the real world of education, the amount of accounted for variance decreases dramatically. Sullivan, OkadaandNiedermeyer(1971) examined the effectiveness of two methods of word-attack instruction for beginning readers. One was the single letter approach, the other, the letter combination approach. The treatments lasted for 27 days. They found a significant interaction between ability level and treatments: Low ability Ss learned better with the single letter method while those of high ability learned better with the letter-combination method. This finding is in keeping with the findings of Salomon and Sieber-Suppes (1972), and are in agreement with the generalization offered by Cronbach and Snow, which was mentioned above. However, our examination of the data reveals that the interaction between ability and methods accounted for no more than 6 per cent of the total post-test variance. In a study by Tallmage and Shearer (1971) an ATI was found between anxiety and inductive and deductive teaching methods. The amount of post-test variance accounted for by the interaction was, however, about 2 per cent only. There may be other studies in which real instructional treatments were employed and where more of the learning variance was accounted for by the ATI. They would be the exceptions rather than the rules, however. The decrease in amount of variance accounted for by any specific aptitude, once under real-life conditions, is apparently unavoidable since

Better psychological insights or better educational practice?

253

educational treatments are by their very nature multivariate and composed of a great many components. It thus appears that ATI research, which is done under contrived and well-controlled conditions, cannot easily serve educational decisions. If it aims at better understanding of instructional functions, it cannot really be expected to serve also the purpose of adapting real environments to learners. Here then is the dilemma: if real-life treatments are to be manipulated and more general abilities to be dealt with, theoretical and explanatory power is lost; if more purified variables are to be dealt with, reference to the practice of education is lost. The conflict between the two approaches cannot be resolved by a linear progression from the more basic research to the applied, as suggested by Hilgard (1964) and later echoed by Berliner and Cahen (1972, 1973). The reason is, as I have tried to show, the decreasing contribution of any specific aptitude to the overall outcomes.

A POSSIBLE WAY TO BRIDGE THE GAP

There may be another way to bring the two approaches somewhat closer to each other, as exemplified by the work of Hunt (1972a, 1972b). One of the major features of his work is the kind of 'aptitude' he deals with. It is the construct of Conceptual Complexity (Harvey, Hunt & Schroder, 1961). This 'aptitude', unlike the mastery of specific mental skills, mastery of knowledge, anxiety, memory, or spatial ability, is very pervasive and can be expected to be related to nearly every phase and act of learning. Moreover, as Hunt shows, this measure suggests both what contemporaneous as well as developmental environments should be matched with what level of Conceptual Complexity. Even more important is the fact that this differential measure has the potential of interacting with highly specific treatments, and with whole school environments as well. It should be noted, in addition, that Conceptual Complexity is not related to general intelligence (see Karlins, 1967) and hence does not reflect one's accumulation of specific school-related knowledge (McLelland, 1973) to mask ATIs. Indeed, the attributes that make the discovery method more beneficial to the conceptually complex learner and the expository method more beneficial to the less complex one are overriding attributes of those treatments. Since also Conceptual Complexity is highly pervasive, the power of the interaction - obtained in a laboratory study - is not lost when

254

Gavriel Salomon

the treatments are carried into the classroom. As the data show, they do not seem to be lost even when whole schools are designed around their essential features.

SUMMARY

It appears then, that when a genuinely pervasive aptitude is found, it may interact with equal power with small scale, highly purified treatments, as well as with gross environmental ones. Furthermore, since it is based on theoretical grounds, it provides explanatory principles concerning the types of treatments, or environments, which are most beneficial for one or another kind of learner. In this sense, the discrepancy between the two types of ATI research can be resolved. In many other cases, where more specific and somewhat less pervasive aptitudes are measured, and where highly purified treatment variables are studied, findings cannot be expected to lead to educational decisions. Their real importance is in improving our understanding of psychological functions. Similarly, where whole curricula are studied, being as complex packages as they are, much can be gained in terms of educational decision making. However, one should not expect such studies to deepen our understanding of the hows and whys. It is apparently only in rare cases, as in Hunt's studies, that the discrepancy can be bridged.

REFERENCES Berliner, D . C . (1971) Aptitude-treatment-interaction in two studies of learning from lecture instruction. Paper presented at the annual meeting of the American Educational Research Association, New York. Berliner, D . C . , & Cahen, L.S. (1972) Some continuing conceptual and methodological problems in trait x treatment interaction research. Paper presented at the annual meeting of the American Psychological Association. Honolulu. Berliner, D . C . , & Cahen, L.S. (1973) Trait-treatment interactions and learning. In Kerlinger, F . N . (ed.), Review of research in education. Peacock Publishers, Inc. Cronbach, L.J. (1967) How can instruction be adapted to individual differences? In Gagné, R. M. (ed.), Learning and individual differences. Charles Merrill. Cronbach, L.J., & Snow, R.E. (1969) Individual differences in learning ability as a function of instructional variables. Final Report, USOE, Stanford Univ. School of Education. Dowalby, F.J. (1971) Teacher-centered v. student-centered mode of college classroom instruction as related to individual differences. Unpublished Master's Thesis, Univ. of Mass.

Better psychological

insights or better educational practice?

255

Harvey, O. J., Hunt, D. E., & Schroder, H. M. (1961) Conceptual systems andpersonality organization. New York: John Wiley and Sons. Hilgard, E.R. (1964) A perspective on the relationship between learning theory and educational practice. In Hilgard, E.R. (ed.), Theories of learning and instruction. The Sixty-Third Yearbook of the NSSE. Hunt, D.E. (1972a) Learning styles and teaching strategies. Paper presented at the National Council for the Social Studies. Boston. Hunt, D.E. (1972b) Matching models for teacher training. In Hunt, D.E. (ed.), Perspectives for reform in teacher education. Prentice Hall Inc. Karlins, M. (1967) Conceptual complexity and remote associative proficiency as creative variables in a complex problem solving task. J. Pers. soc. Psychol. 6,264-78. Koran, M.L., Snow, R.E., & McDonald, F.J. (1971) Teacher aptitude and observational learning of a teaching skill. J. educ. Psychol. 69,219-29. McLelland, D. (1973) Testing for competence rather than for 'Intelligence'. Amer. Psychol. 28,1-15. Melton, A.W. (1964) The taxonomy of human learning: Overview. In Melton, A.W. (ed.), Categories of human learning. Academic Press. Oettinger, A., & Marks, S. (1968) Educational technology: New myths and old realities. Harvard educ. Rev. 36,697-18. Salomon, G. (1972a) Heuristic models for the generation of aptitude-treatmentinteraction hypotheses. Rev. educ. Res. 42,327-43. Salomon, G. (1972b) Can we affect cognitive skills through visual media? An hypothesis and initial finding. A V Commun. Rev., 20,401-23. Salomon, G., & Sieber-Suppes, J. E. (1972) Learning to generate subjective uncertainty: The effects of training, verbal ability and stimulus structure. J Pers. soc. Psychol. 23, 163-74. Shulman, L.S. (1970) Reconstruction of educational research. Rev. educ. Res., 40, 371-97. Snow, R.E. (1970) Research on media and aptitudes. In Salomon, G., & Snow, R.E. (eds.), Commentaries on research in instructional media. Viewpoints, 46,63-91. Snow, R.E. (1973) Representative and quasi-representative designs for research on teaching. Paper read at the Annual Meeting of the American Educational Research Association, New Orleans. Sullivan, H. J., Okada, M., & Nidermeyer, F.C. (1971) Learning and transfer under two methods of word-attack instruction. Amer. educ. Res. J., 8,227-41. Tallmage, G.K., & Shearer, J.W. (1969) Relationship [among learning styles, instructional methods and the nature of learning experiences. J. educ. Psychol. 60,222-31. Tallmage, G.K., & Shearer, J.W. (1971) Interactive relationships among learner characteristics, types of learning, instructional methods and subject matter variables. J. educ. Psychol. ,62,31-39.

H A N S F. C R O M B A G

18

University of Leyden

Product and process in teaching and testing

ON EDUCATIONAL OBJECTIVES

All education is directed towards two general classes of objectives: the acquisition of knowledge, and the acquisition of skills in using knowledge. These two classes of objectives are general since knowledge acquisition, for instance, may refer to the simple memorization of a specific fact, e.g. the name of the founder of evolution theory, as well as to assimilating complex theoretical frameworks, e.g., evolution theory itself. The class of skill objectives is also broad; by 'skill' we may refer to the application of a simple general rule to a particular incident as well as to complex reasoning processes like those involved in medical inquiry or legal decision making. At the outset one more point about the concept of skill should be made clear. Usually a sharp distinction is made between psychomotor skills on the one hand and mental, or intellectual skills on the other. Usually the two areas have been treated separately in psychological research and theory. I think such a distinction is unnecessary, since 'all skilled performance is mental in the sense that perception, decision, knowledge and judgment are required' (Welford, 1968, p. 21). In this paper I will make no distinction between mental and psychomotor skills; although I will be dealing primarily with mental skills, it should not be excluded that, as for example in medical inquiry, psychomotor actions occasionally constitute elements in the process of skilled performance. The distinction between knowledge acquisition and the acquisition of skill as objectives for education is, however obvious, difficult to maintain strictly. If knowledge is defined very narrowly as the ability to reproduce precisely what was presented at some earlier time, the objective becomes

Product and process in teaching and testing

257

uninteresting. In this restricted definition, knowledge is at best a means, not a goal in itself. Usually some understanding of what is learned is expected from students, but to understand something is to be able to do something with it, to manipulate the content in at least a virtual way. Bloom (1968) defines the most simple form of comprehension as translation; changing the form of a knowledge element without changing its content or meaning. Furthermore, the transition from virtual manipulation of knowledge to the application of knowledge to particular incidents is rather small (Bloom, Hastings, & Madaus, 1971). So the distinction between knowledge acquisition and doing something with acquired knowledge is difficult to maintain strictly in educational practice. The transition is gradual. Let us nevertheless, for the sake of argument, make a sharp distinction between the two. Then one could say that education directed toward knowledge acquisition is product-oriented. All we want from the student is that he learns the material presented to him, and that he is able to reproduce it as accurately as possible upon demand. As long as the product, i.e., precise reproduction, is correct we are satisfied. In school we are satisfied with a correct product no mattei what odd association process or silly mnemonic is used to retrieve the answer from memory. In real life, toward which all learning supposedly is directed, we are even satisfied if the answer is not retrieved from memory, but looked up in a source book or obtained from a colleague. In knowledge acquisition, if the student gives an incorrect answer all we do is simply correct him; this correction in itself is a positive learning experience. From hearing the right answer the student learns; the already available correct answer is reinforced, or the available incorrect answer is replaced by the correct one in the student's memory. The difference of education directed toward the acquisition of skills from the situation depicted above may be best illustrated by an example. Suppose we teach students of medicine the anatomy and pathology of the groin area. We want them to learn to use this knowledge in diagnosing patients with swellings in that area of the body. Traditionally in medical education this is done by way of patient demonstrations; the teacher examines in view of the students a number of patients with swellings in the groin area, and while doing so he talks, explaining what he is doing and why. Now suppose that afterwards we want to test whether the student has indeed learned the desired skill. We present him with a patient who has a

258

Hans F. Crombag

swelling in the groin area, asking him to make a diagnosis of the patient's illness. Now our interest in the correct product, the diagnosis, is only secondary. Primarily we are interested in the reasoning process which precedes the production of the product; the process of inquiry. We want that process to be rational, because in the future we want the student to be able to repeat the process, to communicate to others his solution and the way in which he derived it, to defend his decisions in a rational discussion, to correct his decisions at the first sign that somewhere a mistake has been made, and to adapt them in the light of new evidence. Rational and explicit reasoning processes are easier to repeat, communicate, discuss and correct than intuitive and impressionistic processes. When testing a student on his skill in diagnosing swellings in the groin area we do not simply ask for his final diagnosis, e.g., 'this is a case of hernia femoralis', we want a record of the whole inquiry process; which data he gathered, how and why he did so, which alternatives he took into consideration, the pros and cons for these alternatives, and how he finally decided on one of the alternatives. If the student finally comes with an incorrect solution, a simple correction of the product is not, as in knowledge acquisition, a positive learning experience in itself. At best, the student learns from such correction only that somewhere along the line something apparently went wrong. But he does not know what and where. In skill acquisition, to make the student learn we must have process-oriented teaching and testing.

KNOWLEDGE ACQUISITION AND REPRODUCTION AS A PROCESS, AND THE ROLE OF INDIVIDUAL DIFFERENCES

In education directed toward knowledge acquisition we are productoriented; there is an input, and after some lapse of time there is an output, and we want input and output to be similar. However, between input and output something happens; the student does something. He stores the presented information, and afterwards he retrieves it from memory. Consequently some processing of the information takes place. Taking the Atkinson and Shiffrin (1968) model of memory as a paradigm, we can say that the information presented to the learner has to be registered by one of the sensory registers, and transferred into the short-term store (STS). In STS it has to be kept for some time in order to be recognized, i.e., compared with elements retrieved from the

Product and process in teaching and testing

259

long-term store (LTS) and to be encoded, and finally it has to be transferred into LTS and stored in some 'appropriate' place, where later it can easily be retrieved. When a student is tested for his knowledge by way of a stimulus question, this question has to be brought into STS and recognized, a search process in LTS has to be initiated, elements from LTS have to be retrieved, brought into STS, and tested against the stimulus question, before the response can be made. I give this elaborate description - and even this description can be said to be superficial - to make very clear that knowledge acquisition and subsequent knowledge reproduction constitute a sequence of mental operations, that this sequence can rightfully be called a process, and that performing this process can rightfully be called a mental skill. Is it wise to leave such a complicated process to the student? It has been argued many times that the way in which knowledge is acquired, especially the way in which the information is organized when it is stored in LTS is important for later retrieval (Mandler & Pearlstone, 1966; Mandler, 1967). 'Organization' in this context means the grouping of information elements in categories and networks, and the formation of hierarchies. 'More organization, more recall' is a frequently heard proposition that seems to be corroborated by a large number of empirical studies, e.g., the already mentioned study by Mandler and Pearlstone, but also for example by studies of Underwood (1964), Jenkins and Russell (1952), and Tulving (1962). There are in this area, however, conflicting results, as Anderson (1972) and Wood (1972) have recently pointed out. Postman (1970), who failed to find a strong relationship between a measure for (objective) organization and a measure for free recall in a serial learning task, indicated that inter-item associations may play an important role in recall. Postman's position is close to that taken by Anderson (1972), who states that networks of interword associations constitute powerful retrieval systems. Now I think that the discussion on whether organization does or does not enhance recall is due to the fact that frequently an insufficiently clear distinction is made between rational, experimenter-defined organization, on which measures for organization are almost always based, and subjective, usually idiosyncratic organizations used by subjects (Ss) in learning tasks. The associative networks which 5s build during serial learning tasks are in fact organizations of an idiosyncratic nature imposed by the 5s on the material to be learned. For outsiders these subjective organi-

260

Hans F. Crombag

zations may look chaotic; for a particular 5, however, they may constitute retrieval systems, as powerful as the highly rational hierarchical structures which experimenters have in mind when they design learning tasks. I think this is also the position taken by Wood (1972). From the point of view of a teacher, one might argue that however effective idiosyncratic organizations may be for later recall, rational organizational principles are to be preferred for scientific information, and scientific information is after all what in education we are most of the time trying to transmit to students. Knowledge acquisition, however, does not take place in a vacuum; new information has to be inserted into already existing knowledge structures (Underwood, 1964). These preexisting knowledge structures are products of the student's past experience and personal history. The extent to which idiosyncratic organizational principles, based on a student's personal history, take preference over rational organizational principles depends on whether the material to be learned is completely new to the student and remote from his past experience, or whether the material or components in it are already familiar to the student. If one asks 5s to memorize a list of normal, everyday-language words, presented to them repeatedly in a random order, 5s tend to form associative networks or groups of words of a highly idiosyncratic nature, like those given by Anderson in the appendix to his article from 1972. Trying to memorize a list of words among which were the words 'sideburn', 'beard', 'lieutenant', and 'dignitary', an 5 tried to do this by saying to himself: 'The lieutenant has sideburns, the dignitary has a beard'. Those are associations which are not too far-fetched, but other, more rational groupings could have been made from the total list. In an experiment in which I myself have taken part (Langerak, 1973), which was really a pilot study for a larger one we are still working on, we found similar results. We presented our 5s five times with a list of 20 words, each time in a different order. The list contained words belonging to one of four categories. After the presentation was over, we asked our 5s to perform three tasks, one after another. In task 1 we gave the 5s a list of 60 words, similar to those in the learned list. Among those 60 words there were 16 belonging to the same categories as those in the learned list. The 5s were asked to check in the list 16 words which 'had anything to do with the words in the learned list'. In task 2 we asked our 5s to describe in their own words how they had done task 1. The answers given were scaled in such a way that 5s who

Product and process in teaching and testing

261

mentioned all four categories obtained the highest score, and Ss who did not mention any of the categories obtained the lowest score. As task 3 the Ss were asked to reproduce the learned words as they remembered them. The four categories in the list were vehicles (yacht, tilt cart, etc.), words in which parts of the human body are mentioned (thumb screw, finger print, umbilical cord, etc.), words indicating things in which something can be kept (briefcase, gas tank, biscuit tin), and finally, words of English origin, used as such in Dutch language (paperclip, tearoom, babysitter, etc.). Although the four categories seem not too difficult to identify, many Ss failed to do so. Instead they made their own associative networks. Many combined 'babysitter', 'umbilical cord', both from the original list, and 'patient care', a word from the list of task 1. Intuitively these words indeed seem to have something in common, but it is almost impossible to give a name to the common element. Also very strongly connected were the words 'paperclip' and 'briefcase', both from the original list. Another result I would like to mention is the relation between the extent to which the 5s had found the experimenter-defined categories (task 2) and recall (task 3). The correlation was only 0.11, which again indicates that subjective, idiosyncratic ways of organizing material to be studied may constitute equally powerful retrieval devices as rational, experimenterdefined categorizations. We may conclude that in learning tasks in which already familiar material is presented, Ss often prefer idiosyncratic principles of organization to more rational ones. Attempts at imposing rational principles of organization on students might, in such cases, only cause interference effects, and might consequently hinder learning instead of helping it. Things may be quite different, however, if the material presented to students is new or mostly new to them. For example, in teaching law students the theory of torts a teacher may try to teach not only the relevant concepts, but also to present these concepts in such a way that the students organize them in a highly rational way. In another experiment in our institute, Cohen and De Gruijter (1973) gave law students 34 concepts from the theory of torts, printed on seperate cards. They asked them to sort the cards according to similarity of meaning, using as many or as few categories as they wished. It would lead us too far to discuss this experiment in detail here. Let me simply report the one result that is most relevant; the groupings made by the students were far from idiosyncratic. Although different students made

262

Hans F. Crombag

different groupings, the majority of the students made quite rational and logical groupings. Concepts were grouped together because in the theory of torts they are (I) synonyms or (2) antonyms, because in the theory of torts they are (3) causally or (4) hierarchically related, or because in the theory they constitute (5) cumulative or (6) subsidiary conditions for certain legal qualifications. These six organizational principles explained the vast majority of the groupings made by the Ss. The neat organization which was found was undoubtedly due to earlier education, in which the concepts apparently were presented in an organized way. The teacher-given organization probably survived in the students' minds because the material is rather remote from everyday life experience. Let me try to sum up. Looking at knowledge acquisition somewhat closer, we have found that a complicated mental process is involved. The central parameter affecting this process appears to be the degree of organization which is given to the material to be learned. This central parameter is, however, highly sensitive to individual differences; the organization given to material to be learned is in many instances at least partly idiosyncratic, due to differences in personal history and background, and is therefore difficult for outsiders like teachers to predict and understand. In teaching, I think, we should allow for individual differences in the way students organize subject matter. Organization of subject matter enhances recall because search strategies in memory can only operate on the basis of some organization (Shiffrin, 1970). So, we should stimulate students to organize material to be learned. At the same time we should, however, avoid imposing one particular type of organization on all students. Rather, we should try to make them aware of the variety of organizational principles that can be applied to any part of the subject matter; we should stimulate them to experiment with organizational principles and to organize and reorganize learned material in as many different ways as they can possibly think of. This implies that for knowledge transmission highly teacher-organized methods like programmed instruction should be avoided. In testing knowledge acquisition we should make simple rote memorization of subject matter even more ineffective than it already is, by never asking for verbatim reproduction or recognition of subject matter. Instead we should favor questions which require some form of reorganization of the information as originally presented.

Product and process in teaching and testing

263

PROBLEM SOLVING AS A PROCESS AND THE ROLE OF INDIVIDUAL DIFFERENCES

Knowledge acquisition is never the sole objective of education. We always want the students to be able to do something with their knowledge. In this paper I want to concentrate on the use of knowledge in complex problem solving, particularly in real life situations, like diagnosing a patient's illness, deciding a legal dispute, or designing a machine meeting certain specifications. Typical for these types of problem solving is that they must be performed in stages, with the whole reasoning process taking hours and sometimes even days to be executed completely. If we want students to learn to perform these skills, we need a description of the reasoning process which leads from problem to solution, because it is that very process which is our educational objective. We want students to be able to perform that process repeatedly for different, but similar, problems. Teaching for problem solving skills should be processoriented because we do not want to teach simple solutions to problem connections; we want the student to be able to construct his own solutions, and to know explicitly how this is done. I think Gagne's statement, that 'unless something has gone wrong, the achievement of a solution to one member of a class of tasks should mean immediate generalizability to any other member of the class', holds only if two conditions are met: (a) the problem solver has himself performed all mental operations leading from problem to solution, and (b) he has explicit knowledge about these operations. The first condition is probably self-evident. For the second, I would like to refer to Katona's (1967) observations of Ss who learned by mechanical repetition the solution of a card trick without understanding the relation between the task and the solution. Katona says: 'This learning had sometimes the advantage of speed . . . but it had the disadvantage that it did not spread.' So there was no transfer of learning. In skill training our objective is the transmission of the process. For designing an educational procedure to train students in performing a particular skill, we need a clear description of the process involved. We cannot get a complete description by simply observing skilled problemsolvers, because most if not all of the process occurs within the heads of the people we want to observe. When we ask skilled problem solvers to give us formal descriptions of what they do, they are usually very fragmentary. The formal side of the work has become a highly automated routine over time, and the problem solver is almost completely preoccupied by the content of the problems.

264

Hans F. Crombag

One might try to get at a description of the reasoning processes involved in complex problem solving by making experts think aloud and by studying the resulting protocols. Well known in this respect is the work of a group working in medical education at Michigan State University (Elstein et al., 1972). An important conclusion of their work on the medical inquiry process is that physicians do not work systematically and progressively from problem to solution. Rather, they generate at a very early stage specific diagnostic hypotheses well before they have gathered most of the data on a particular case. This result agrees with observations of my colleagues and myself in a study of protocols of skilled lawyers solving legal disputes (Crombag, etal, 1972). We also found that skilled lawyers, trying to decide a hypothetica lease, seemed to have a provisional solution available at a very early stage, which thereafter they tried to sustain with systematic legal arguments. The observation is not new. It was made earlier by de Groot in his study of the thought processes of chess players (de Groot, 1946), who in turn refers to the work of Selz (1922). Selz's 'schematische Anticipation' is a name for the same phenomenon. Trying to solve a problem one needs some anticipatory image of its solution to guide the subsequent search process, or, to speak in Bartlett's terms, to transform an open system into a hypothetical closed system (Bartlett, 1958). Elstein and his colleagues say that these early, provisional hypotheses 'are generated out of the physician's background knowledge of medicine, including his range of specific experiences . . . ' . By the time students of medicine start their training in medical inquiry they may be supposed to have, at least to a certain extent, a 'background of knowledge of medicine', but what they quite definitely do not have is experience. Yet such experience is essential if they are to perform inquiry processes in the same way as do expert physicians. The implication is that a theory like that developed by Elstein and his colleagues can only have limited relevance for designing an educational procedure for training students in the skill of medical inquiry. In our own study of the process of deciding legal disputes we therefore soon changed from describing precisely what trained lawyers do, to designing a 'rational reconstruction' (the term stems from Popper, 1968) of the process. By a rational reconstruction, we mean a stepwise description indicating how the reasoning would have progressed if it had been executed in a strictly rational manner. Space does not permit a detailed discussion of the method used in designing a rational reconstruction of

Product and process in teaching and testing

265

266

Hans F. Crombag

the process of solving legal disputes. This has been presented elsewhere (Crombag et al, 1973), along with a precise description of the rational reconstruction itself (Crombag et al., 1972). The final version of the reconstruction contains 42 operations and numerous feedback loops. On the basis of this reconstruction a training procedure was designed, and students are instructed on the successive steps needed to progress in a rational way from problem to solution. By way of a series of exercises they are trained in using the instruction. Students who have taken this course solve hypothetical cases strikingly and significantly better on the average, than do comparable students who have not taken the course as is shown in Figure 1 (Crombag et al., 1973). It should be mentioned that there are methodological problems involved in comparing the two groups. But let me state that even a very conservative interpretation of the data leaves us with an important difference. In teaching students to solve complex problems we want them to learn to perform one of a rather narrowly defined class of sequences of mental operations, which we have defined as rational. Why rational? Because we want the performance to be repeatable for different but similar problems. We want transfer of learning. We are not satisfied with a lucky guess that produces the correct product. We are not satisfied with a reasoning process in which a mistake is made, and in which a second mistake brings the student accidentally back on the right track. We are satisfied with a correct solution if and only if it is reached by way of a correct reasoning process, because we want the linkage between the problem and its solution to be rational (see Cronbach, 1966, p. 78). Thus, we want all students to learn to perform the same reasoning process. What about individual differences then? Students may differ in many respects, and part of these differences should make a difference in problem solving. A difference, however, for what? The product or the process? Almost invariably in studies on individual differences in learning, product variables are used as criteria. The outcome is that some students, the brighter ones oi the more motivated ones, consistently produce more correct products than others. Also in ATI-studies almost always product variables serve as criteria. Let us, for instance, consider in somewhat greater detail an ATI-study by Egan, reported by Greeno (1972). In this study the 5s had to learn the binomial formula. Three predictor variables were used: (a) a score on the Mathematical Scholastic Aptitude Test (MSAT); (b) a score on a test measuring the 5s' familiarity with a number of concepts involved in

Product Figure 2. (From

Greeno,

and process

in teaching

and

1972)

S® 8 CO o 6 c 2 8 4 LU c •2 2 o Q. £O o

Tfigh

Low

Medium MSAT

Low

Medium Concepts

High

Low

Medum Arithmetic

High

»© 8 o a. 6 ' S o

Í 2 o a o n

8r o a P 4 c .2 2 t o a. o ¿ 0

O-

-O

Rule Discovery

testing

268

Hans F. Crombag

binomial calculus ('Concepts'); and (c) a score on a test measuring the 5s' 'skill at some arithmetic operations involved in calculating binomial probabilities' ('Arithmetic'). Two treatments were used in the study; a rule-learning treatment, and a learning-by-discovery treatment. In the discovery method, 5s were given a series of problems in which ideas involved in the binomial distribution were implicitly given. 5s had to discover these ideas by themselves while solving problems. In the rule-learning method these ideas were explicitly given to 5s, who were thereafter asked to use them in solving a number of problems. Three criteria were used, of which I will only discuss one; the number of errors made in a 10-item test following the completion of instruction, in which each item required calculation of binomial probability. The results, summarized in Figure 2, indicated that bright students had learned better, i.e., made fewer mistakes, than 5s low in ability, which was to be expected. Contrary to expectations the performance of 5s in the two treatments on the average did not differ significantly. However, it appeared that low ability students learned better in the rule-learning situation. Greeno concludes that there is an interaction effect between ability and treatment. In my opinion a better way to summarize his data is to say that only part of the interaction showed up; low ability students learn better from rule-learning, while high ability students learn equally well under both treatments. Why then devise new methods of learning by discovery if the oldfashioned rule-learning methods do equally well for bright students and even better for less bright students? I think that in this experiment the most important effect of the discovery method was missed, an effect stressed earlier by Wittrock (1966). The discovery method is a powerful method for training students in the process of discovery, i.e., in discovering general rules that govern series of concrete incidents. In other words the learning by discovery method is a process-oriented learning method. Used as a product-oriented learning method, it is at best equally effective as compared to traditional, less cumbersome methods, and might easily become less effective. Most of what we know about the role of individual differences in learning stems from product-oriented studies; some students produce on the average better products than others, and while some students produce better products in some situations, others do better in other situations. What about individual differences in process-oriented education? Do

Product and process in teaching and testing

269

individual differences play a role in complex problem-solving? In a study by Shulman et al. (1968), one of the few studies in which something like process-variables were included, it appeared that 5s with different 'seeking styles' performed differently in a complex problem situation; 5s with a 'dialectical seeking style' did detect a larger number of problems in a given situation, and used more information sources than did Ss with a 'didactical seeking style'. However, as already may be guessed from these results, Ss with a didactical seeking style also produced solutions of poorer quality than their dialectical colleagues. In general it may be said that, confronted with a complex problem situation, 5s may vary widely in their attack of the problem. But almost always differences in problem-solving style go together with differences in the quality of the solutions produced as a result. And since we are only interested in solutions of good quality, which, moreover, should be repeatable in different but similar situations, in teaching complex problemsolving we do not want to adapt to individual differences, we want to erase them. Although students before training may show spontaneously quite different problem-solving styles - some may for example tend to an intuitive approach, others to an analytical approach - after training we want all our students to be able to perform the same sequence of mental operations in spite of initial differences. Depending on their initial, spontaneous problem-solving style, some students will learn a particular problem-solving process faster than others. What we therefore need in skill training seem to be highly organized teaching procedures, in which students are repeatedly guided through the desired sequence of mental operations, in which each operation is tested each time it is performed, and corrected whenever a mistake is made, and in which the students are trained to a criterion, a number of consecutive performances of the desired sequence of mental operations. One might call such a teaching procedure fashionably, a 'learning for mastery' procedure, and the testing element embedded in it 'formative evaluation'. While this new terminology may be appropriate, I prefer to call the teaching and testing procedure best fitted to the training of problem-solving skills by its old and familiar name: programmed instruction. Programmed instruction is a teaching procedure highly adapted to individual differences among students, with the aim of erasing them in the end.

270

Hans F.

Crombag

REFERENCES Anderson, J. R. (1972) FRAN: A simulation model of free recall. In Bower, G. H. (ed.), The psychology of learning and motivation. Advances in research and theory, 5. New York: Academic Press. Pp. 315-78. Atkinson, R.C., & Shiffrin, R.M. (1968) Human memory: A proposed system and its control processes. In Spence, K. W., & Spence, J. T. (eds.), The psychology of learning and motivation. Advances in research and theory, 2. New York: Academic Press. Pp. 89-195. Bartlett, F. C. (1958) Thinking. New York: Basic Books. Bloom, B.S., ed. (1968) Taxonomy of educational objectives. Handbook I: Cognitive domain. New York: David McKay Company. Bloom, B.S., Hastings, J. T., & Madaus, G.F. (1971) Handbook of formative and summative evaluation of student learning. New York: McGraw-Hill Book Company. Cohen, M.J., & de Gruijter, D.N. (1973) De organisatie van juridische begrippen bij Studenten en docenten. Een eerste experiment. Leiden: Educational Research Center, University of Leyden, memorandum no. 227-73. Crombag, H.F. (1973) Het oefenen van vaardigheden: het juridisch praktikum. In van Woerden, W.H., Chang, T. M. & van Geuns-Wiegman, L.J.M. (eds.), Onderwijs in de maak. Utrecht: Het Spectrum, 1973. (English version available as memorandum no. 204-73 of the Educational Research Center, University of Leyden). Crombag, H.F., de Gruijter, D.N., Cohen, M.J., & Langerak, W.F. (1973) Het praktikum methoden en technieken in de Fakulteit der Rechtsgeleerdheid. Leyden: Educational Research Center, University of Leyden, report no. 12. Crombag, H.F., de Wijkerslooth, J.L., & van Tuyll van Serooskerken, E.H. (1972) Over het oplossen van casusposities. Groningen: H.D. Tjeenk Willink. (An abbreviated version, entitled 'On solving legal problems' is in press in The Journal of Legal Education, 1973-1974, 26, 4). Cronbach, L. J. (1966) The logic of experiments on discovery. In Shulman, L. S., & Keislar, E.R. (eds.), Learning by discovery: A critical appraisal. Chicago: Rand McNally & Company. Pp. 76-92. Elstein, A.S., Kagan, N., Shulman, L.S., Jason, H., & Loupe, M.J. (1972) Methods and theory in the study of medical inquiry. J. med. Educ., 47, 85-92. Gagné, R.M. (1964) Problem solving. In Melton, A.W. (ed.), Categories of human learning. New York: Academic Press. Pp. 293-317. Greeno, J. G. (1972) On the acquisition of a simple cognitive structure. In Tulving, E. & Donaldson, W. (eds.), Organization of memory. New York: Academic Press. Pp. 353-377. de Groot, A.D. (1946) Het denken van den schaker. Amsterdam: Noord-Hollandsche Uitgeversmaatschappij (Translated in English as Thought and choice in chess, The Hague: Mouton & Co., (1965). Jenkings, J. J., & Russell, W. A. (1952) Associative clustering during recall. J. abn. soc. Psychol., 47, 818-21. Katona, G. (1967) Organizing and memorizing. New York: Hafner (reprint of 1940 edition). Langerak, W.F. (1973) An instrument for measuring memorization habits of students Leyden: Educational Research Center, University of Leyden, report no. 14. Mandler, G. (1967) Organization and memory. In Spence, K. W., & Spence, J.T. (eds.) The psychology of learning and motivation, Advances in research and theory, 1. New York: Academic Press. Pp. 327-372. Mandler, G., & Pearlstone, Z. (1966) Free and constrained concept learning and subsequent recall. J. verb. Learn, verb. Behav., 5, 126-31.

Product and process in teaching and testing

271

Popper, K . P . (1968) The logic of scientific discovery. London: Hutchinson (revised edition). Postman, L. (1970) Effects of word frequency on acquisition and retention under conditions of free-recall learning. Q. J. exper. Psychol., 22,185-95. Selz, O. (1922) Zur Psychologie des produktiven Denkens und des Irrtums. Bonn: Friedrich Cohen. ShifFrin, R . M . (1970) Memory search. In Norman, D . A . (ed.), Models of human memory. New York: Academic Press. Pp. 375-447). Shulman, L.S., Loupe, M.J., & Piper,R.M. (1968) Studies of the inquiry process. East Lansing, Mich.: Office of Education Bureau of Research. Tulving, E. (1972) Subjective organization and free recall of 'unrelated' words. Psychol. Rev. 69, 344-54. Underwood, B.J. (1964) The representativeness of rote verbal learning. In Melton, A.W. (ed.), Categories of human learning. New York: Academic Press. Pp. 47-78. Welford, A.T. (1968) Fundamentals of skill .London: Methuen & Co. Wittrock, M.C. (1966) The learning by discovery hypothesis. In Shulman, L.S., & Keislar, E.R. (eds.), Learning by discovery: A critical appraisal. Chicago: Rand McNally & Company. Pp. 33-75. Wood, G. (1970) Organizational processes and free recall. In Tulving, E., & Donaldson, W. (eds.), Organization of memory. New York: Academic Press. Pp. 49-91.

MARY LOU K O R A N

19

University of Florida

Improving aptitude and achievement measures in the study of aptitude-treatment interactions1

The most fundamental question confronting researchers wishing to investigate ATI is the question of which variables to select in hopes of finding educationally fruitful interactions. The number of possible aptitude, treatment and achievement variables is quite large, hence the number of combinations which may be tested is virtually inexhaustible. The most basic conclusion to emerge from ATI literature is that general ability measured by aptitude tests (whether educationally loaded or not) predicts success in most learning tasks where meaningful content is employed. While there are laboratory tasks to which general ability appears to have little relevance, correlations have been repeatedly found of broad ability measures or broad composites of verbal reasoning tests with meaningful learning outcomes both in the classroom and under controlled conditions of practice (Cronbach & Snow, 1969). As Cronbach and Snow have pointed out, this confirmation of the importance of general ability is not discouraging to the study of ATI, since if one is to have interaction it is necessary to have a dependable positive relationship of one aptitude measure with one treatment against which another treatment can be contrasted. The most basic question, of course, is how to devise or discover alternative treatments with a flatter regression function, or which actually capitalize on aptitudes other than general ability. Because of the repeated correlation of general ability measures with learning outcomes it has been recommended that contrasting treatments 1. This paper has been adapted in part from a presentation made at the CTBMcGraw-Hill Invitational Conference on The Aptitude-Achievement Distinction, Carmel, California, February 1973, the proceedings of which are published in book form (Green, 1973).

Improving aptitude and achievement measures

273

be developed, one of which relies on general ability, while one does not (Cronbach & Snow, 1969). There has been little progress to data, however, in identifying or developing treatments that actually capitalize on aptitudes other than general ability. It is important to note that our generally used aptitude tests are designed for, and validated primarily in terms of, predicting learning outcomes in our rather uniform educational programs as they are presently constituted. Test items are selected largely on the basis of their predictive power rather than on their relationship to observed or hypothesized intellectual processes. They are not generally designed to measure basic processes underlying various kinds of learning, nor to assess performance prerequisites for new learning tasks. Consequently, these aptitude constructs may not be the most fruitful dimensions for measuring those individual differences that do interact with different ways of learning (Glaser, 1972). There are likely to be many aspects of human ability that have been largely disregarded or untapped by conventional testing and instruction, but which may be predictive of scholastic performance under different instructional methods. It is possible that instructional treatments might be designed around distinctions such as Cattell's (1963) fluid ability representing general 'brightness' and adaptability, relatively independent of education and experience, as opposed to crystallized ability consisting of more acquired knowledge and developed intellectual skills. There is now increasing evidence to suggest that the analytic abilities measured by tests of fluid ability are related to success in learning conceptual materials (Wilson & Robeck, 1964; Taylor & Fox, 1967). If two treatments are found, one of which has a flatter slope than the other, the interaction is practically useful. If the slopes are actually reversed from one aptitude measure to the other, this suggests ways of developing constrasting treatments capitalizing on alternative aspects of general ability. Similarly, Jensen's investigation of Level I as opposed to Level II abilities suggests that some learners should be taught by instructional techniques that can utilize abilities manifested in rote learning while others should be taught in a conceptual or meaningful way (Jensen, 1969). Some attempts to construct new forms of aptitude measures tapping skills and coding systems other than those typically found in conventional instruction (Seibert & Snow, 1965; Seibert, Reid & Snow, 1967) have resulted in a number of experimental tests using film and audio communication media, some of which have been shown to function uniquely, in a manner opposite to that of general ability measures (Koran, Snow &

274

Mary Lou Koran

McDonald, 1971). Hopefully additional developmental work of this nature will emerge in conjunction with ATI research. The most fruitful approach to the problem appears to lie in the direction of process analysis (Melton, 1967; Koran, 1973). A process analysis of the effects of learner characteristics entails formulation of a model of the processes which are required for performance in a given set of tasks. The model might include such process variables as stimulus differentiation, encoding association, response integration, shortterm memory, retrieval and so forth. The kind of cognitive processes required for an adequate performance depends, of course, on the nature of the task. Detailed task analysis is required in guiding both the development and selection of aptitude tests believed to measure these processes, and ways in which contrasting treatments could be profitably formed. If differential treatment effects are initially found, more refined psychological explanations of ability-performance relationships may be developed through attempts to alter the slopes of the regression lines by experimental variation in the treatments, providing an important step in the understanding of the psychological nature of both the aptitude and treatment variables under consideration. The major variables in this conception are the task and ability variables which are relevant to a specific performance, and ways of experimentally manipulating these variables. The objective of this kind of analysis is to explain mental ability as a set of skills in analyzing situations in which one kind of response works best under one kind of instruction while another kind of response is more effective under different instruction (Cronbach, 1970). It is not unlikely that such analysis may eventually lead to new process-oriented conceptualization of both aptitudes and treatments. It is noteworthy that aptitude variables have generally been studied in the context of psychometrics while process variables have tended to be studied in the context of laboratory learning. For the study of either to flourish it would seem useful not only to have theories of abilities and processes, but also theories of tests in these areas (Evans, 1970). Aptitude variables are typically considered as variables affecting processes involved in performances such as concept learning, problem solving and so forth. However, there are describable behaviors involved in these performances which are potentially capable of psychometric scaling. Attention has in fact been devoted to this issue. The 1969 invitation Conference of Ordinal Scales of Cognitive Development sponsored by the California Testing Bureau was concerned with building psychometric scales based on

Improving aptitude and achievement measures

275

theories of Piaget and Inhelder. Similarly the 1965 ETS Invitational Conference on Testing called for measures of what might be termed higher mental processes in solving abstract problems (Stoddard, 1966). More recently, a series of studies (Bunderson, 1970) has shown that concept learning tasks may be analyzed into two phases: dimension selecting and associative learning. Each phase may be further analyzed into processes which imply a requirement for certain reasoning abilities for dimension selection, and certain memory abilities for associative learning. Attempts to define and develop from the concept learning tasks themselves new aptitude measures similar to inductive reasoning, associative memory and other processes have indicated that these process tests generally account for a larger proportion of performance variance than do pre-existing factor tests. Moreover, the process tests cannot be accounted for by the factor tests, but form a distinct cluster of factors of their own. It is conceivable that the correlations of such process tests with performance could be due to nothing more than identical task-specific elements in the two measures, and further research will be required to determine the generalizability of such process measures across tasks or classes of tasks. It seems likely, however, that the measurement of cognitive aptitudes in ATI research will be increasingly based on careful analysis of information processing requirements generated by specific learning tasks rather than on the selection of pre-existing tests (Bunderson, 1970). It should be recognized that the discussion of process analysis need not be limited to cognitive abilities. Relevant to this issue, Cronbach and Snow (1969) have concluded that the borderline between personality and ability is easily permeated. There is preliminary evidence to suggest that abilities lead to personality traits and that conversely, personality traits lead to aptitude patterns determining response to instruction. It is widely recognized that personality variables can have both interfering and facilitating effects on cognitive processes. The research from which this conclusion is drawn, however, typically has not indicated which cognitive processes are affected or how they are effected. Thus, these studies generally fail to indicate any specific ways in which persons could be treated or situations modified in order to maximize learning. Research is now being systematically conducted to determine the influence of individual differences in non-cognitive variables on cognitive processes involved in problem solving. Seiber (1969), for example, hypothesized that the debilitating effect of anxiety on problem solving was due to its

276

Mary Lou Koran

effect on short-term memory. Therefore treatments providing memory support were expected to primarily benefit highly anxious subjects. This expectation was supported. Differentiations between state and trait anxiety (Spielberger, O'Neil & Hansen, 1970) have further extended this line of inquiry. Additional research of this nature is being conducted in such diverse areas as the effects of differences in reflectivity and impulsivity or leveling and sharpening on problem solving (Kagan & Kogan, 1970). Depending on whether treatment and interaction effects occur which satisfy a reasonable model of information processing, this approach may permit a highly satisfactory construct explanation to be made concerning the specific ways in which noncognitive variables influence cognitive processes which in turn affect response to varying instructional methods (Koran, 1973).

RELATIONSHIP TO ACHIEVEMENT MEASURES

The identification and development of relevant aptitude and achievement measures for ATI research must necessarily consider the multivariate nature of learning outcomes. Any instructional treatment has multiple effects, and the treatment that is best in producing immediate mastery is not necessarily best for producing retention, transfer, affective outcomes or other indices of achievement. Moreover, it is likely that different aptitude measures may be related to these various instructional outcomes (Koran, 1971; Koran, Snow & McDonald, 1971; Dunham & Bunderson, 1969). While immediate mastery and retention have probably been most commonly examined in ATI research, previous research and theory would appear to emphasize the usefulness of considering a broader range of dependent variables in studies of this type. Ultimately, differential assignment of subjects based upon ATI must consider which of many possible learning outcomes are to be obtained. Process analysis of aptitude and treatment variables may be considered only partial when confined to immediate mastery of the learning task, and thus directly related to some criterion variables while uncorrelated or only indirectly related to others. Transfer of training, for example, may be expected to be dependent upon initial learning. However, the extent to which ability-performance relationships may be expected to change from learning to transfer tasks cannot readily be determined. Previous research has indicated that such relationships may be exceedingly complex, and the

Improving aptitude and achievement measures

277

present state of theory and research does not provide a basis for sophisticated prediction. Only through careful exploration of the interrelationships among multiple independent measures can a theoretical basis for understanding and prediction be developed (Koran, 1973). It is important to recognize in this regard, that differential effects of treatments may not be revealed in the form of an interaction if the criterion measures employed are always subject-matter achievement (Messick, 1970). Potentially important ATI effects may not become apparent until we broaden our base of achievement measurement to assess changes in variables such as problem-solving processes and strategies, characteristic modes of thinking, motivational variables and other outcome variables which may in turn affect response to future instructional treatment. An associated point is that it is unlikely that interaction effects of this nature will be manifest in laboratory-type experiments of short duration (Cronbach & Snow, 1969). It is important that investigators give such interactions a chance to operate. In doing so, it will be necessary for instructional procedures to be continued long enough to permit students to progress realistically through a body of material and become thoroughly familiar with the instructional style being considered (Koran, 1973). It should be recognized that aptitude-achievement relationships may not only be affected by instructional method and learning outcome, but also by the nature of the specific subject matter and criterion measures selected. While changes in aptitude-achievement relationships in ATI may sometimes appear capricious, it is conceivable that they may be determined by highly specific task or criterion variables. Mischel (1968) has observed that we have built much of our study of aptitude in terms of generalized tendencies to the neglect of the interaction with situational variables. As Glaser (1970) has suggested, ATI may be more highly task or subjectmatter specific than is generally recognized, with style being determined in one sense by the kinds of processes required to operate in science, mathematics, art, and so forth. Thus, a student may be a convergent thinker operating in mathematics, and a divergent thinker in art; a person may be highly anxious in some situations or tasks, but not in others. Moreover, the design of the criterion measure itself may be an important factor in the aptitude-achievement relationships obtained. Murray (1949), for example, found that while a numerical factor best predicted course grades in geometry, a verbal factor best predicted an achievement test score in the subject. Similarly, it has been suggested (Fitts, 1967) that

278

Mary Lou Koran

the perception and encoding of stimuli depend at least partially on the specific responses to be made to the stimulus information. Criterion tests can be designed to balance various forms of content and process, as for example in the Hamilton (1969) study contrasting verbal and spatial treatments, in which a criterion measure was designed to include verbal items, spatial items, items with verbal stems and spatial alternatives, and items with spatial stems and verbal alternatives. The point to be made is that there is need not only for more careful analysis of the information processing requirements of instructional treatments, but also for corresponding analysis of criterion test items with which to identfy differential treatment effects.

SUMMARY

In summary, to discover and demonstrate ATI requires a style of research that has only recently become the conscious concern of investigators. However, the experimental obligation to develop improved measures of both aptitude and achievement in connection with ATI research is clear. The selection of relevant aptitude variables in ATI research will undoubtedly differ across problem areas and individual researchers. In some cases task-specific capabilities may be of central importance while in other cases more general measures of abilities, problem-solving processes and non-cognitive dispositions within the learner may be fruitful interactive variables. In each case, however, identification and development of relevant aptitude measures should be guided by a theoretical conception of the ways in which aptitude enters into the instructional process. Analysis of the ways in which internal response processes work in relation to various stimulus and overt response variables will be required in guiding both the choice of aptitude and achievement measures, and ways in which contrasting treatments could profitably be formed. Hopefully, from such analyses will emerge new process-oriented conceptualization of both aptitudes and treatments leading to the development of mutually compatible aptitude measures, instructional treatments and achievement measures.

Improving aptitude and achievement measures

279

REFERENCES

Bunderson, C.V. (1970) Aptitude by treatment interactions: Of what use to the instructional designer. Paper presented at the American Psychological Association Cattell, R.B. (1963) Theory of fluid and crystallized intelligence: A critical experiment. J. educ. Psychol., 54,1-22. Cronbach, L.J. (1957) The two disciplines of scientific psychology. Amer. Psychol., 12, 671-84. Cronbach, L.J. (1970) Mental tests and the creation of opportunity. Paper presented before the American Philosophical Society. Cronbach, L.J., & Snow, R.E. (1969) Individual differences in learning ability as a function of instructional variables. Final report, U.S.O.E., Contract No. OEC4-6-061269-1217, School of Education, Stanford University. Dunham, J.L., & Bunderson, C.V. (1969) Effect of decision rule instruction upon the relationship of cognitive abilities to performance in multiple-category concept problems. J. educ. Psychol., 60,121-25. Evans G. (1970) Intelligence, transfer and problem-solving. In Dockrell, W.B. (ed.), On intelligence. London: Methuen and Co. Ltd. Fitts, P.M. (1967) Perceptual experiments and process theory. In Gagné, R. (ed.), Learning and individual differences. Columbus, Ohio: Charles E. Merrill Books, Inc. Glaser, R. (1970). In Wittrock, M., & Wiley, D . (ed.), The evaluation of instruction: Issues and problems. New York: Holt, Rinehart and Winston. Glaser, R. (1972) Individuals and learning: The new aptitudes. Educ. Res., I, 5-13. Green, D. R. (Ed.) (1973) The aptitude-achievement distinction. CTB-McGraw-Hill. Hamilton, N. R. (1968) Differential response to instruction designed to call upon spatial and verbal aptitudes. Technical Report No. 5. Project on Individual Differences in Learning Ability as a Function of Instructional Variables. Stanford: Stanford University. Jensen, A.R. (1970) Hierarchical theories of mental ability. In Dockrell, W.B. (ed.), On intelligence. London: Methuen Co., Ltd. Kagan, J., & Kogan, N . (1970) Individual variation in cognitive processes. In Mussen P.H. (ed.), Carmichael's manual of child psychology, Volume 1 (3rd ed.) New York: Wiley. Pp. 1273-1365. Koran, M.L. (1971) Differential response to inductive and deductive sequences of programmed instruction. J. educ. Psychol., 62,219-28. Koran, M.L. (1973) Aptitude, achievement and the study of aptitude-treatment interactions. In Green, D . R . (ed.), The aptitude-achievement distinction. CTB-McGrawHill. Koran, M.L., Snow, R.E., & McDonald, F.J. (1971) Teacher aptitude and observational learning of a teaching skill. J. educ. Psychol., 62, 219-28. Melton, A. (1967) Individual differences and theoretical process variables: General comments on the conference. In Gagné, R. (ed.), Learning and individual differences. Columbus, Ohio: Charles E. Merill. Messick, S. (1970) The criterion problem in the evaluation of instruction: Assessing possible, not just probable, intended outcomes. In Wittrock, M., & Wiley, D., (eds.), The evaluation of instruction: Issues and problems. New York: Holt, Rinehart and Winston. Mischel, W. (1968) Personality and assessment. New York: John Wiley. Murray, J. E. (1949) Analysis of geometric ability. J. educ. Psychol., 40,2. Seiber, J.E. (1969) A paradigm for experimental modification of the effects of test anxiety on cognitive processes. Amer. educ. res. J., 6,46-61.

280

Mary Lou Koran

Seibert, W., & Snow, R.E. (1965) Studies in cine-psychometry I: Preliminary factor analysis of visual cognition and memory. Final Report, U.S. O.E.G. No. 7-120280-184, Audio Visual Center, Purdue University, Lafayette. Seibert, W.F., Reid, C., & Snow, R.E. (1967) Studies in cine-psychometry II. Final Report, U.S.O.E., 7-24-0880-257, Lafayette, Indiana: Purdue University, AudioVisual Center. Spielberger, C.D., O'Neil, H . F . , & Hansen, D . N . (1970) Anxiety, drive theory and computer-assisted learning. CAI Center Technical Report No. 14, Tallahassee, Florida: Florida State University. Stoddard, G. (1966) On the meaning of intelligence. Proceedings of the 1965 Invitational Conference on Testing Problems. Princeton, N. J., Educational Testing Service. Taylor, J.E., & Fox, W.L. (1967) Differential approaches to training. Alexandria, Va: Human Resources Research Office. Wilson, J.A.R., & Robeck, M. (1964) A comparison of the kindergarten evaluation of learning potential (KELP), readiness, mental maturity, achievement, and ratings of first-grade teachers. Educ. psychol. Meas., 24, 409-14.

W. G E O R G E G A I N E S E U G E N E A. J O N G S M A Louisiana State

20

University

Carroll's model of school learning as a basis for enlarging the aptitude-treatment interaction concept

INTRODUCTION

In his presidential address to the American Psychological Association, Cronbach (1957) characterized the two divergent disciplines of scientific psychology. Experimental psychologists, he claimed, had been interested only in variation which they could create while correlational psychologists had found interests in already existing variation between individuals and social groups. To the experimental psychologist, individual differences were an annoyance that represented error variance. The correlational psychologist, on the other hand, welcomed individual and group variation but disdained treatment differences as a source of error variance. Cronbach made a plea for the federation of the two disciplines. In calling for a joint application of experimental and correlational methods, he suggested a redirection of research in applied psychology that would consider treatments and subjects simultaneously. The goal should be to identify aptitudes which inteiact with modifiable aspects of the treatment. Glaser (1972), in his presidential address to the American Educational Research Association, enlarged upon Cronbach's original theme. He discussed the divergence between the two fields of psychology in terms of two contrasting educational environments - selective and adaptive modes of education - and their respective responses to individuality. Like Cronbach, Glaser held out the promise of aptitude-treatment interactions as a fruitful line of inquiry. In spite of the fact that few aptitude-treatment interactions had been solidly demonstrated, he was hopeful that the development of 'new aptitudes' would increase the efficacy of aptitude-treatment interaction research. The new aptitudes that Glaser referred to were

282

W. George Gaines and Eugene A. Jongsma

cognitive processes growing out of contemporary theories of learning, development, and human performance. John B. Carroll's model of school learning (1963) is a description of the economics of learning in the context of the school. The model hypothesizes five variables affecting success in school and how they interact. Although the five variables are conceptually independent, they are functionally interrelated. Thus the model may serve as a vehicle for the investigation of aptitude-treatment interactions. The purpose of this paper is threefold. First, an overview of Carroll's model of school learning will be given with attention directed to the interactive nature of the components of the model. Secondly, weaknesses identified in past aptitude-treatment interaction research are briefly reviewed. Thirdly, in response to the weaknesses identified, suggestions for future aptitude-treatment interaction research will be discussed in the context of the Carroll model.

CARROLL'S MODEL OF SCHOOL LEARNING

The basic formulation of the Carroll model is that a learner would reach mastery of a specific learning task provided that he spent the time he needed to learn that task. Carroll placed the five variables of the model into a formula expressing the degree of learning for the z'-th individual for the f-th task as a function of the ratio of the amount of time actually spent in learning to the amount of time needed for learning. Degree of learning = f

time actually spent time needed

.

The numerator of this fraction is always equal to the smallest of the following: (a) opportunity, (b) perseverance, or (c) time needed. The denominator, time needed, is equal to aptitude plus additional time as determined by the interaction of quality of instruction and ability to understand instruction whenever the former is less than optimal. The first three variables of the model, aptitude, ability to understand instruction, and quality of instruction, are determinants of time needed for learning. Aptitude. In Carroll's model a learner's aptitude for a specific learning task is defined as the amount of time needed to master the task under optimal conditions. Optimal conditions are: (a) quality of instruction is

Carroll's model of school learning

283

optimal for the learner; (b) perseverance, the time the learner is willing to spend, is equal to or greater than time needed; and (c) opportunity, the time the learner is allowed, is likewise equal to or greater than time needed. Ability to understand instruction. Carroll proposed ability to understand instruction as a variable independent of aptitude. This variable is defined as the learner's ability to perceive the nature of the learning task and the steps to be followed in learning it from teachers and instructional materials. Since most instruction assumes a highly verbal form, some appropriate indexes of this variable are listening ability, verbal ability, and reading comprehension. Quality of instruction. The third variable of the model, quality of instruction, is defined as the degree to which the organization, presentation and explanation of the learning task are suited for a learner to master a task as rapidly and efficiently as possible. Whenever quality of instruction is less than optimal for a learner, he will need additional time beyond that already required by his aptitude for the learning task. The fourth and fifth variables of the model, opportunity and perseverance, are determinants of time spent in learning. Opportunity. The amount of time allowed for learning is defined as opportunity. Whenever opportunity is less than time needed the degree of learning will be less than total mastery. Perseverance. The fifth variable of the model is perseverance, or the time a learner is willing to spend in learning a given task. Carroll noted that there are many complex factors - motivational, emotional, and cultural - that enter into perseverance. Whenever perseverance is less than opportunity or time needed, it acts to reduce the degree of learning. Interactive nature of the Carroll model. Carroll hypothesized an interaction between quality of instruction and ability to understand instruction. The nature of this interaction is such that the degree of learning for learners low in ability to understand instruction will be more severely retarded than for learners high in ability to understand instruction when quality of instruction is low. Time needed can now be formulated as aptitude (time required under optimal conditions) plus the interaction of quality of instruction and ability to understand instruction (additional time required as determined by the degree of a learner's ability to understand instruction and the extent to which quality of instruction deviates from the optimum for a learner). Although not specifically hypothesized in the model, Carroll did raise

284

W. George Gaines and Eugene A. Jongsma

the question of how quality of instruction might affect perseverance. If by raising the quality of instruction we also increase perseverance, this would have the effect of lowering the ratio of time spent to time needed. An interesting question is whether the effect of quality of instruction on perseverance is consistent across levels of ability to understand instruction. It may be that quality of instruction interacts with ability to understand instruction on perseverance. Any interactive effect of quality of instruction and ability to understand instruction on the degree of learning would necessarily have to be interpreted in terms of the interactive effects on time needed and time spent. If indeed quality of instruction interacts with ability to understand instruction on both time needed and time spent in the manner previously described, then such an interactive effect on the degree of learning would be far more pronounced than if only one of these interactions were present. The interactions, both explicit and implicit in the Carroll model, discussed thus far have been concerned with quality of instruction and ability to understand instruction. Ability to understand instruction may be classified as a highly generalized trait which may mediate the acquisition of any learning task. Carroll's model, however, additionally permits the examination of more specific kinds of trait-treatment interactions. According to Carroll the quality of instruction for a learner may be determined by his special learning needs and characteristics.

WEAKNESSES IN PAST APTITUDE-TREATMENT INTERACTION RESEARCH

In this section let us turn our attention to an examination of some of the major problems and weaknesses that have been identified in past research on aptitude-treatment interactions. Treatment definitions. One of the reasons past research has not uncovered more aptitude-treatment interactions is because the treatments have been defined broadly and conceived without regard for personological or aptitude variables. Bracht (1970) reviewed nearly 100 studies which were designed to permit a test of aptitude-treatment interactions. He classified treatments as controlled or uncontrolled depending upon the extent to which they were influenced by external conditions. He found that only five studies, of all those reviewed, produced disordinal interactions. Four of those five studies involved controlled treatments. Bracht concluded that it is un-

Carroll's model of school learning 285 likely that uncontrolled treatments, which include a variety of tasks, will interact with aptitude variables. Aptitude definitions. The problem of aptitude definitions is, in many respects, parallel to the problem of treatment definitions. That is, aptitudes have traditionally been defined rather globally and not in relation to specific instructional treatments. In the Bracht review (1970) cited previously, personological variables, or aptitudes, were classified as factorially simple versus factorially complex. Factorially simple variables were related to specific abilities, interests, attitudes, or traits, whereas factorially complex variables represented measures of general ability and achievement. Bracht found that studies which defined aptitudes in specific terms were much more likely to discover significant interactions between such aptitudes and instructional treatments. Bracht's findings should not necessarily be interpreted to mean that more general aptitudes do not interact with some treatments. As Salomon (1972) attempts to explain, the reason for the failure to detect treatment by general aptitude interactions may be attributable to the failure of the treatments to be meaningfully related to the general aptitudes. Irrespective of whether aptitudes are defined as simple or complex, their relation to the treatment definition is paramount. Interaction definitions. Most researchers have explored aptitude-treatment interactions through analysis of variance procedures. Rejection of a null hypothesis of interaction has customarily been followed by plotting the cell means involved in the interaction. Lubin (1961) has elaborated on the distinction between two types of interaction, ordinal and disordinal. Ordinal interactions occur when the rank order of a treatment is constant but the quantitative effect varies. Disordinal interactions occur when the rank order of the treatments change with the value of an aptitude variable. The primary concern of aptitude-treatment interaction researchers has been the detection of disordinal interactions because of their implications for differential assignment of learners to treatments. Some problems have arisen in plotting and testing significant aptitudetreatment interactions. The exact procedures for plotting interactions have not been clearly defined and may vary from study to study. For example, some researchers have plotted the aptitude variable on the abscissa, while others have plotted the treatment variable on the abscissa. Glass and Stanley (1970) warn that the decision as to which variable is plotted on the abscissa may greatly influence the interpretation of the

286

W. George Gaines and Eugene A. Jongsma

interaction. Depending upon the researcher's decision, the interaction may be either ordinal or disordinal in appearance. Furthermore, merely plotting the interaction, as is often suggested, is not adequate. Marascuilo and Levin (1970) have carefully delineated proper post hoc comparisons for interactions in analysis of variance designs. Another shortcoming of aptitude-treatment interaction research has been the lack of attention given to the detection of second-order or threeway interactions. Mitchell (1969) has suggested that a given interaction may be more closely connected to a particular level of learner development than most researchers realize. This could mean that an interaction may not be generalizable across developmental stages. Mitchell's suggestion that the temporal dimension be included as a third factor in aptitudetreatment interaction studies might be expanded to also include other relevant variables as third factors. As a two-way interaction can obfuscate main effects, so may a three-way interaction obfuscate a two-way interactive effect. This could account for many of the seemingly conflicting findings in aptitude-treatment interaction research. Dependent variable definitions. Past aptitude-treatment interaction research has typically focused on single dependent variables. The importance of considering the multivariate outcomes of aptitude-treatment interactions has been succinctly pointed out by Mitchell (1969, p. 698): If the person-environment interaction is critical for understanding and predicting human behavior, it is equally apparent that this interaction can only be defined effectively in multivariate terms. We are multi-trait individuals responding to multi-characteristic environments, and the total pattern of these interactions determines the direction of our behavior. The multivariate aspects of performance have also been alluded to by Cronbach and Snow (1969). The focus of researchers on single dependent variables in aptitude-treatment interaction studies may be too simplistic an approach in view of the complexity of performance.

CARROLL'S MODEL AS A HEURISTIC FOR APTITUDE-TREATMENT INTERACTION RESEARCH

In response to the problems identified in the previous section, attention will now be focused on suggestions for overcoming these problems. Particular attention will be directed toward the use of the Carroll model as a paradigm for investigating aptitude-treatment interactions.

Carroll's model of school learning

287

Improving treatment definitions. In terms of the Carroll model, treatment is synonymous with quality of instruction. The concept of quality of instruction is extremely broad, encompassing variables such as teacher performance and characteristics of instructional materials. Within this general framework, however, the researcher has the freedom to select specific treatment variables of interest. Carroll suggests that treatments may be profitably defined along two dimensions. The first dimension includes those factors that would be expected to promote a high quality of instruction for all learners. For example, high quality of instruction would tend to be associated with the following: (a) the learner's recognition of the task to be learned and how he is to learn it; (b) the learner being placed in adequate sensory contact with the instructional means; and (c) the learner being prepared for every step in the learning sequence. The second dimension includes those factors that would be expected to promote a high quality of instruction for the individual learner. For example, high quality of instruction would tend to be associated with those factors that accommodate the individual learner's special needs and characteritics. A number of strategies for conceptualizing treatment variables have been reported in the literature. Cronbach and Snow (1969) suggested that the processes performed by subjects in learning certain tasks be carefully analyzed so that treatments which capitalized on their abilities might be developed. In a similar vein Glaser and Nitko (1971) described component task analysis as a scheme which includes the structure of the subject matter and the psychological structure of the learner in determining a treatment variable. Improving aptitude definitions. Carroll's model accounts for both simple and complex aptitudes. In the phraseology of the model, aptitude is synonymous with what others refer to as simple or task-specific aptitude, whereas ability to understand instruction is synonymous with terms such as complex or generalized aptitude. To improve aptitude definitions does not necessitate a choice between simple or complex aptitudes. Rather, improvement will occur when aptitudes - simple and complex - are defined in relation to some meaningful treatment dimensions. Carroll's notion of aptitude (that is, task specific aptitude) is conceptually related to the dimension of quality of instruction that accommodates the individual learner's special needs and characteristics. On the other hand, Carroll's notion of ability to understand instruction is conceptually related to the dimension of quality of

288

W. George Gaines and Eugene A. Jongsma

instruction that involves factors which are generalizable across learners. While researchers have tended to investigate one or the other of these aptitudes, the Carroll model suggests that both should be considered simultaneously. Improving interaction definitions. Carroll explicitly hypothesized an ordinal interaction between quality of instruction and ability to understand instruction on time needed. With a small degree of extrapolation one may say that quality of instruction and ability to understand instruction interact in an ordinal fashion on perseverance. (Perseverance and time spent are equal when opportunity is not an influencing factor.) If either or both of these interactions occur, then quality of instruction and ability to understand instruction will also interact in an ordinal manner on the degree of learning. Also implicit in the model is the provision for the examination of disordinal interactions between taskspecific aptitudes and those factors in quality of instruction that are conceptually related to that aptitude. Although the ordinal and disordinal interactions in the Carroll model are conceptually independent, it is impossible to assess their separate effects simultaneously in the twoway ANOVA design. It appears that utilization of the three-way analysis of variance design will provide more powerful tests of aptitude-treatment interaction hypotheses. Improving dependent variable definitions. Carroll's model suggests two possible dependent variables - perseverance and degree of learning. Most of the past aptitude-treatment interaction research has considered only degree of learning as a dependent variable and usually assessed it in a general way such as overall achievement. However, perseverance, or the amount of time the learner willingly spends engaged in learning the task, has largely been ignored as a dependent variable in aptitude-treatment interaction research. It would be possible, and even desirable, to view degree of learning and perseverance simultaneously, in a multivariate sense, in the assessment of performance. Another important consideration is the relationship between the measure of the dependent variable and the aptitude-treatment interaction. In some studies researchers have used general achievement measures to assess the effect of specific treatments. In order to be maximally sensitive, however, dependent variable measures must be conceptually related to the treatment and aptitude dimensions. As researchers begin to apply more rigorous techniques, such as component task analysis, in describing and developing treatments, they may

Carroll's

model of school learning

289

begin to see a greater need for multiple dependent variables. Instead of a single terminal behavior, the learning task would consist of a set of specific interrelated behaviors. We would like to conclude this paper with the following statement of Shulman (1970, p. 374): 'Aptitude-treatment interaction will likely remain an empty phrase as long as aptitudes are measured by micrometer and environments are measured by divining rod.'

REFERENCES Bracht, G.H. (1970) Experimental factors related to aptitude-treatment interactions. Rev. educ. Res., 40, 627^5. Carroll, J.B. (1963) A model of school learning. Teach. Coll. Rec., 64,723-33. Cronbach, L.J. (1957) The two disciplines of scientific psychology. Amer. Psychol., 21, 671-84. Cronbach, L.J., & Snow R.E. (1967) Final report: Individual differences in learning ability as a function of instructional variables. Stanford, California: Stanford University. Glaser, R. (1972) Individuals and learning: The new aptitudes. Educ. Res., 1, 5-13. Glaser, R., &Nitko, A. (1971) Measurement in learning and instruction. In Thorndike, R.L. (ed.), Educational measurement. Washington, D.C.: American Council on Education. Glass, G.V., & Stanley, J.C. (1970) Statistical methods in education and psychology. Englewood Cliffs, New Jersey: Prentice-Hall. Lubin, A. (1961) The interpretation of significant interaction. Educ. psychol. Meas., 21, 807-17. Marascuilo, L.A., & Levin, J.R. (1970) Appropriate post hoc comparisons for interaction and nested hypotheses in analysis of variance designs: The elimination of type IV errors. Amer. educ. res. J., 7, 397-421. Mitthell. J. V. (1969) Education's challenge to psychology: The prediction of behavior from person-environment interactions. Rev. educ. Res., 39, 695-721. Salomon, G. (1972) Heuristic models for the generation of aptitude-treatment interaction hypotheses. Rev. educ. Res., 42,327-43. Shulman, L.S. (1970) Reconstruction of educational research. Rev. educ. Res., 40, 371-96.

M I C H A E L W. A L L E N The Ohio State

21

University

Student initiated testing: The base for a new educational system

Piobably the most widely used instructional method at every level of institutionalized education is the instructor's presentation of prepared academic statements to a captive audience of from 25 to 500 students. At most levels, the presentation is correlated with published text and may be accompanied by blackboard illustrations or audio-visual aids. The presentations are fitted into a fixed time period of approximately 50 minutes or so, according to logistical decisions for managing classroom use to meet the needs of the average student curriculum. The number of weeks through which a course of instruction runs is typically standardized, semesters or quarters, for example, and at pre-established points during term (and nearly always at the end), examinations are given to assess the students' progress. Since the instructor's work day is devoted to the preparation and delivery of daily presentations, there is little possibility of any significant interaction with individual students or even small groups working on special projects. Even when preparations from a previous term are reused, an informal analysis of an instructor's day reveals many hours spent at administrative and clerical tasks (attendance, enrollment, and grade reports, withdrawal permission, scheduling changes, examination scoring, grade assignment, test writing, etc.). The demand for attention to these tasks often requires instructors, cognizant of the need to update and revise class presentations, to reuse outdated preparations and to base the instruction on a single, and perhaps outdated, text. The students' learning environment produced by the above educational structure is far from adequate. Since presentations are designed to meet the needs and abilities of the average student, many students are either under or over challenged by the learning task. The highly motivated, high

Student initiated testing: The base for a new educational system

291

ability student must suffer slow progression through the topic, take the full term to cover the domain of the course established for less able students, and receive only limited attention to his 'too advanced' questions. The slower student suffers the frustration of more and more difficult progression because of a poor grasp of basic concepts which were covered too quickly. Both upper and lower ability students are likely to transfer negative attitudes developed toward the learning experience to the subject matter, since they frequently do not dichotomize the two. A dislike of school often accompanies a dislike of the topics taught in school, and vice versa.

THE TESTING GAME

Educational testing is a game teachers and students play. Although both recognize the value of test taking as a learning experience, rarely are tests used primarily for the purpose of learning. Except in extreme cases, individual selection of appropriate learning activities based upon test results does not occur. Occasionally a student will interpret test results clearly enough to undertake appropriate action on his own. If his decisions are correct and his activity effective, he may produce relatively higher scores on the next examination, although this subsequent test is likely to cover new material for which he has (appropriately) not had time for study, will probably not cover to any significant extent the topics he now understands, and will yield a score to be averaged or otherwise combined with his previous score. Part of the testing game under increasingly popular reform involves the construction of the tests. In an attempt to motivate students to learn everything presented to them, 1 the specific knowledge or skills to be tested were unknown to the student (if not also to the teacher until a day or so before the test). The testing game players try to out-guess each other. The instructor with little or no risk, depending upon his conscience, tries to ask questions about points he feels to be important, but perhaps not directly conveyed to the student audience. The student tries to decipher the instructor's values from notes taken at information and value gathering sessions (classes), combine this data with any acquired knowledge of the 1. Perhaps I am too generous in even suggesting credibility to the overworked rationalization.

292 Michael W. Allen instructor's test construction style, weigh each content topic for its probability of test coverage, and then study accordingly. Stress now being placed on the use of behavioral objectives is sure to reform the rules of the game. Students will know exactly what to study and what constitutes an appropriate test. Defense of test items will be necessarily based on stated objectives rather than whim of the instructor. Selection of a few test items to test an ambiguously defined area of learning will be replaced with tests designed to measure the accomplishment of each and every objective.2

ALTERNATIVES

It is the author's hypothesis that troubles facing current educational practices which were briefly described previously, and upon which the reader has surely extrapolated by now, are based upon the demands made by overstructured, time-oriented administration of educational resources. It is incredible that the community of educational professionals has allowed the needs of educational administration for efficient clerical and financial systems to so bias the values of the system itself. But then, given the poor instructor-to-student ratio commonly existing and the shortage of clerical assistance, one might very well reply that there was very little possibility of adopting better instructional strategies. Although some might argue the extent to which administrative demands for management efficiency have established instructional strategies, it is certain that these demands tend to lock us into static systems. Just imagine the response one would get when suggesting that new students might be admitted into a program on any day of the year, depending upon when currently enrolled students finish, and that it is not known just how many students will have finished on a given day, since completion time is dependent upon student performance. Add to this the reporting of an 'A' grade for every student completing the course! There is no question that truly individualized educational systems create management problems of extreme complexity and that perhaps before the computer age such systems of any significant size were impossible. But 2. Although the use of such objectively constructed tests is essential to the educational system to be described, arguments supporting behavioral objectives are not included here since they are well rehearsed in much of the contemporary educational literature.

Student initiated testing: The base for a new educational system

293

there is also no question that such systems are feasible using widely available computer hardware and software systems.

COMPUTER MANAGED INSTRUCTION

To prove this point, a sophisticated computer program was designed and implemented using the widely available Coursewriter III program product by IBM. Coursewriter III provides an author language, telecommunications support for remotely located computer terminals, and utility programs for the collection and storage of student performance data. The basic system was slightly modified at OSU to support Hazeltine 2000 cathode ray tube terminals which are able to present elaborate displays at high speeds, make no noise, and do not make printed copies of tests presented to students. Use of Coursewriter III ensures transferability of the program from one institution to another at low costs and with comparative ease. The program is generally classified as a Computer Managed Instruction (CMI) program, since its function is that of test generation, data recording, and the prescription of learning activities best suited to the needs of individual students determined through repetitious diagnostic testing. The program was designed by Allen, Meleca, and Myers (1972) to test students on their achievement of established behavioral objectives upon command. Test questions are selected from test-item banks provided for each behavioral objective divided into three cognitive levels (reduced from Bloom's six (Bloom, 1956): (a) knowledge-comprehension, (b) applicationanalysis, (c) synthesis-evaluation. Within the restrictions of the number of questions to be presented for each objective and the percentage of questions to come from each of the three levels, items are chosen randomly by the computer. Since the program is designed to manage modularized instruction (maximum of 32 objectives per module), decisions must be made concerning student progression from one module to another. Rather than build such decisions into the program, Allen and Philabaum 3 have extended the CMI program to allow instructors to make on-line decisions which determine the specific course flow. Answers made by instructors to on-line computer3. Carl Philabaum, Coursewriter III Programmer, Computer Assisted Instruction, The Ohio State University.

294

Michael W. Allen

presented questions determine the management strategy to be used for all or some of the students. Up to 100 management strategies may be active at one time each using up to 30 of the developed modules. Table 1 lists the questions presented to the instructor during the development of a management strategy. Table 1. Management strategy questions answered by instructors (a) Enter list of required modules or EOB if none. If they are to be taken sequentially, then enter the code letters in order. (b) Must these modules be taken in fixed sequence as shown? (yes-no). (c) Enter list of optional modules or EOB if none. (d) Are students to be prevented from attempting another module when they are having difficulty obtaining mastery? (yes-no). (e) How many of the optional modules should students select? (f) Are there prerequisites for any of the modules? (yes-no). (g) Enter module letter followed by one or more prerequisite module letters.

CMI does not attempt to present instruction to the student. Its function is to provide meaningful testing to guide students in their use of such learning resources as books, films, laboratories, and conferences with instructors. It must be emphasized that tests are generated for student guidance and not for performance grading. Tests are given in a 'no risk' situation in that they may be taken as often as results are useful and as often as is needed for the achievement of mastery. Other features of the system allow student review of behavioral objectives not mastered within any module, access to extended descriptions of the objectives, generation of tests in modules not required for the course (with a 'no record' option), retaking of previously mastered tests with no risk option (results are not recorded if inferior to previous performance), and access to all performance data accumulated by the computer. Rather than detail further the specific system components which are available elsewhere, the needs and effects of implementing an educational system based upon student initiated testing and relatively unlimited use of the computer for data handling will be discussed.

THE COST OF REFORMATION: CMI IMPLEMENTATION

Careful specification of educational objectives is necessary for the definition of modules of instruction and for the construction or selection of test items. Only questions relating directly to behavioral objectives can

Student initiated testing: The base for a new educational system

295

be used. Such clear-cut requirements are a tremendous aid to instructors in organizing their courses. The reduction in ambiguity of course goals not only assists the instructor in the continuing process of course improvement, but also increases the possibility of close cooperation between instructors and students in course revision. The benefits of clear definition of course objectives and involvement in course development provided to the student need not be described, although it is difficult to overestimate their quantity and significance. The use of behavioral objectives is extremely beneficial and costs only instructor time. The switch to fixed learning criteria and variable learning time is a more difficult proposition not because of less significant benefits, but because of its incompatibility with current administrative systems. The neatly organized system of beginning and ending all courses of instruction on pre-established dates, calculating the number of students to be admitted by a count of classroom chairs, and charging all students equal fees based on the number of hours of direct instruction must be replaced by an accounting system which can allow students to complete the course at any time performance criteria are reached, schedule students to begin the course soon after others have finished, and tolerate large variances in the amount of time students remain enrolled in the course. Computer programs can undoubtedly manage the task once administrative personnel are convinced of the value of the reform. Not to be ignored are the faculty who must abandon the traditional role of preparing group presentations for the preparation of behavioral objectives, related test items, and individual learning resources. They must also learn to face students on a one-to-one basis, since the role of the instructor will change from that of a stage performer to a learning counselor able to help the student cope with the new responsibility of individualized education. Special facilities are required, including computer hardware, telecommunication lines, reading and reference rooms, and audio-visual devices for individual use. Special computer programs are needed. Special printed materials, slides, video tapes and movies may also be needed to meet the individual needs of students. In short, not only are many opposed to revising educational methods in favor of adapting to individual needs, but it is also a very expensive and time-consuming chore to move to a fixed-learning, variable time system. But the motive to proceed in reform is not slight. Many institutions of every size are making tremendous advances in Computer Managed

296

Michael W. Allen

Instruction. At The Ohio State University, the CMI model is being applied to introductory biology, child development, and industrial education instruction. General applicability of the computer software and the use of standard computer hardware has gone a long way in overcoming some of the traditional barriers.

CONCLUSION

The potential value of an individualized system of instruction based on the mastery of behavioral objectives over a variable period of time seem almost absurdly obvious. In fact, this paper makes no effort to elaborate the increased effectiveness such a system would have; rather, it points out the many and immediate changes needed at nearly every level in educational institutions to accommodate the change to a fixed learning, variable time system. It appears that even with the power of the computer and the general availability of extremely flexible management programs, designed in a 'text-free' format such that insertion of specific content related text and test questions can be accomplished rapidly, resistance to this change may still be overwhelming. At The Ohio State University, several courses are adopting CMI in an attempt to prove the benefits of the new educational system. Funding for full-scale operation has not been found and the system has been compromised for compatibility with the quarter system and the standard A, B, C, D, and F grading system. Nevertheless, enthusiasm for the system is incredibly high among all involved with CMI (including students), students and instructors are working very closely together, and evidence points to very high learning achievements throughout.

REFERENCES Allen, M.W., Meleca, C.B. & Myers, J. A., (1972) A model for the computer management of modular, individualized instruction. Columbus, Ohio: Ohio State University. Bloom, B.S. (1956) Taxonomy of educational objectives: Cognitive domain. New York: David McKay.

APPENDIX 1

Abstracts of submitted contributions not published in full

Trait x treatment interactions with personalized/unitized and lecture/midterm methods of instruction by ROBERT D. ABBOTT and PAULINE M. FALSTROM California State University, Fullerton This study investigated relationships between student characteristics and performance in two statistics classes, one taught by a 'traditional' lecture/ midterm instructional method and the other by a personalized/unitized system of instruction (PSI) following Keller. Measured student characteristics included (a) aptitudes such as grade point average and scores on subtests of the Differential Aptitude Test, and (b) attributes such as biographical data, preferences for particular teaching methods, and personality traits. Student learning was measured on two criteria: (a) score on a common comprehensive final examination, and (b) total points earned in the course. In contrast to other studies comparing PSI with lecture/midterm classes, there were no significant differences between classes in mean performance on either criterion. The design of this study, however, allowed the investigation of interactions between student characteristics and teaching methods. Investigation of these trait x treatment interactions by means of linear regression analysis showed that performance under PSI was better than performance under the lecture/midterm method for students characterized by the following: (a) low aptitude for math, (b) low grade point average, (c) worked, (d) personality traits such as plans work inefficiently, needs sympathy, conceals feelings, or avoids problems. Students with these characteristics performed significantly better under PSI than students with these characteristics taught by the lecture/midterm method. Other student characteristics were positively related to performance under both teaching methods.

298

Appendix

I

In general, these results support the theory that students characterized by behaviors maladaptive to the educational process do better under PSI than in a traditional lecture/midterm class. This study also provides positive confirmation for the existence of attribute x treatment and aptitude x treatment interaction in the teaching of elementary statistics.

Piaget-tasks for formative evaluation and prescription in schools by PATRICIA ARLIN University of North Carolina, Greensboro The stated objective of this paper is to expand the notion of the use of Piagettype tasks from their use in a strictly research or diagnositic mode to that of prescription and formative evaluation. This paper attempts first to extend the context and function of Piagetinspired tests to that of prescription and formative evaluation. Secondly, it focuses upon the interaction of Piagetian-type assessments with the interventions in schools prescribed by those assessments. Finally, it treats of the interplay between such assessments and interventions in the context of the report of an experiment to determine the relationship of relational thinking competence to reading comprehension. In developing these three points it proposes a taxonomy of specific types of thinking ordered along the two dimensions, stage-appropriate and categorycomponent. That taxonomy represents both the order of appearance of particular types of intellectual operations and the degree of dependance of a particular operation upon mastery of previous thought structures. The taxonomy provides a working structure both for curriculum decision making at the systems level and for the choice of specific types of formative evaluation and intervention at the teaching or classroom level.

The Rasch model and its applications to test equating by W.L. BASHAW, R.R. RENTZ, SARAH LEELEN BRIG MAN and CAROLYN GREEN University of Georgia, Athens No discussion of national evaluation or test construction is complete without

Abstracts

of submitted

contributions

not published in full

299

reference to Rasch's contributions to measurement. His model is now receiving extensive investigation in Europe and in the USA. This paper reviews specific objectivity, Rasch's item model, and our use of the model in test equating. Test equating is essential in national evaluation. USOE funded a massive equating study of American standardized reading tests that was completed by Educational Testing Service this year. The project reported here is a USOEsupported re-analysis of the ETS data using Rasch theory. Rasch's work is based on his concept 'specific objectivity'. A measure is 'specifically objective' if it is invariant with respect to the ability of the calibration sample and with respect to the particular sample of questions chosen from the item domain. This characteristic frees test constructors from the need to have highly representative samples of persons in test development and provides norm-free test interpretation. Other advantages will be discussed. In order to obtain this characteristic, it is necessary and sufficient that Rasch's model reasonably describes test data. The model is a simple logistic function for an ICC having one item and one person parameter. Procedures for constructing such measures and analyzing data are available (Wright, 1968; Wright and Panchapakesan, 1969, and Panchapakesan, 1969). Although American reading tests were not constructed to fit Rasch's model, we find excellent stability of results over multiple samples of varying sizes and compositions. We will present samples of equating tables for the reading tests. The results include a reasonable scale for reading that can replace grade equivalents and is used to reference all of seven major reading tests and their parallel forms. Implications for cost efficiency in equating studies are also impressive.

The evaluation of six new educational programs by JANE A. BONNELL, Grand Rapids Public Schools, Grand Rapids An evaluation method has been developed and used in the study of six school programs. Four of the programs studied employ a systems approach, one is a kind of individualized instruction, and one a senior high school reading center. The six programs are: 1. An Alternative Education Program serving 176 students, ages 12 to 19. Alpha Corporation contracts to provide a system, materials and consultant help. The contract learning package provides remedial work in reading and math. 2. A Reading and Math Program in an elementary school. Alpha Corporation provides professional training and technical support services for 300 pupils, grades 1 to 3, in an inner-city, black, low-income school.

300

Appendix

1

3. A Reading Program in a junior high school serving 85 seventh grade students reading below fifth grade level. Alpha Corporation contracts to provide a systematic approach of objectives for reading skills, diagnosis, and prescription. 4. A Learning Management Program in an elementary school serving 148 pupils in a low-income area. Learning Unlimited Corporation contracts to provide a delivery system of individualized units of study, multi-source selfinstruction learning materials, a system of behavior management, and a data processing system providing reportage and evaluation. 5. A Reading Program in a parochial elementary school. The program is a tutorial one to improve reading performance. Thirty-five children, levels 2 through 4, are served. 6. A senior high school Reading Center for diagnosis and development of individualized programs for students. The program serves 154 students per semester, grades 9 to 12, in an integrated inner-city high school.

Academic growth in secondary school black and white pupils from a diverse socio-economic community by FRANK E. BOXWILL, Amityville Public Schools, Amityville, New York BACKGROUND

Recent events have marshalled many forces for and against testing, using educational, aptitude, psychological measures for academic or vocational placement. The criticisms against educational testing using standardized achievement measures are not new. These have given vent to the rise of more precise measures of academic achievement using criterion referenced testing. Since the process of education and evaluation must continue while new means of determining educational effectiveness are being designed and validated, more refined use of those standardized achievement measures available to us seems appropriate and relevant in this time of search for precision and pertinence in educational measurement. This study reports on the comparison of scores of 300 junior high school pupils over a two year period using the Metropolitan Achievement Tests, Advanced Form G, to determine academic achievement and growth patterns. Attention was given to such factors as: (a) The change in grade equivalent scores from eighth grade Metropolitan Achievement Test measures, on replication of the MAT one year later.

Abstracts of submitted contributions not published in full

301

(b) Comparison of the change scores of ethnic or racial groups during the same period. (c) Comparison of the change scores of pupils in different streams or ability groups over the same period. (d) Correlation of scores in total reading and total mathematical ability for the group at large, and subgroups of pupils based on racial identification and ability group assignment. METHODS AND PROCEDURES

MAT grade equivalent scores of 300 ninth grade pupils in 1971 were compared with their grade equivalent scores of the previous year. A correlated f-test was used to determine significance of academic achievement and growth. RESULTS

All academic change scores were significant at the 0.01 and 0.05 level, except in the areas of mathematical problem solving and total math for pupils in the top ability group, and mathematical problem solving for pupils in the low ability group. Most significant change was manifest in science scores. Growth in academic achievement is validated, and interventions to channel these growth patterns through curricular restructuring are discerned. Discussion will focus on the elaboration of the findings and their relevance to effective educational programming.

A systems approach to determining criterion levels of performance: Operationalizing the model by JOE H. BROWN University of Kentucky,

Lexington

The present paper attempts to identify systems components, as well as formative and summative evaluation techniques, for linking differing performance criterion levels of teachers with pupil learning. A recent review of the research on performance based education indicates a minimum standard or criterion of performance. An individual is not considered to be competent until he can perform certain skills with a specified level of proficiency. While performance based education focuses on teacher skills which are related to pupil learning, the critical question seems to be: What minimal performance levels should the teacher reach in order to effect optimal pupil learning? This question is based on the assumption that there are a number of per-

302

Appendix 1

formance factors which relate in a hierarchical arrangement to produce a desired pupil outcome. Furthermore, each teacher performance factor has an optimal criterion level; however, each is dependent on a certain minimum criterion level in each of the teaching skills below it for a significant and desired output. For instance, one often considers a pupil's ability to make inferences (outcome variable) apart from the prerequisite skills for doing so (process variables), e.g., classifying objects and identifying cause and effect relationships. In this case, it is important to determine the necessary teacher behavior and minimal criterion levels for helping students reach criterion levels of both the process and outcome variables. Central to the systems model is the assumption that teacher skills are interdependent and should be regarded as such in the measurement of pupil outcome. Once the teacher skills and criterion levels are identified, cost and time to reach the given level of performance is recorded. Finally, the amount of reinforcement required to maintain the desired performance level is identified.

A model for the integration of evaluation in teaching and learning by NINA W. BROWN Old Dominion University,

Norfolk

Many teacher education programs teach methods and techniques of evaluation apart from both the subject matter material and the methods and materials for teaching. Instead of the evaluative aspect of learning being an integral and constant part of the teaching learning process, it is presented to prospective teachers as an end product, i.e., how well students perform on tests determine how much has been learned in the unit or course. If the process of evaluation were approached with the attitude that it is on-going rather than a product, much could be done by the classroom teacher to integrate evaluation - not just measurement - with the teaching learning process. Evaluation by classroom teachers is generally confined to measurement of cognitive learning assessed by paper and pencil tests, many of which are improperly constructed and ineffectively measure achievement. Although teacher education gives lip service to individual differences, maturation levels and the 'whole' student, little of this is practiced in the evaluation of progress in school. In order to be utilized effectively, evaluation should be taught to teachers to be used as an integral part of their teaching along with methods of adapting materials and techniques peculiar to the individual student and various subject matters.

Abstracts

of submitted

contributions

not published in full

303

A model for the integration of teaching, learning and evaluation will offer methods for making evalution an integral part of the education of the classroom teacher, including assessment and evaluation of cognitive, perceptual and affective learning.

Achievement motivation, values and performance of lowerclass boys by LARRY W. DEBORD University of Mississippi, Mississippi Much recent social science research is concerned directly or indirectly with various forms of achievement. Indicators of achievement range from achievement values and performance of school children to job placement and success of adults. Motivation is central to most explanations of achievement within role at any age or career level. Social researchers have operationalized motivation in numerous ways. Though sociologists have often employed direct verbal measures of motivation appropriate to the achievement setting, projective measures of motives are often used. Motives have been tapped with a variety of devices, however, the most widely used projective measure of achievement motivation (n-Ach) is the Thematic Apperception Test. The numerous studies which have explored the relationships between n-Ach and various measures of achievement performance have produced contradictory findings. Such findings have been explained theoretically by Atkinson's (1957) interactive model of motivated behavior. Others have argued that achievement orientations are not relevant to status attainment processes (Featherman, 1972), or that fantasy-based measures are so unreliable, and predictive studies so seldom control important variables, that little is known about the role of motivation in achievement (Entwisle, 1972). Among the limitations of previous studies of n-Ach are their small samples of predominantly white, middle-class subjects and the treatment of achievement orientation as a unidimensional concept (Kahl, 1965). This study attempts to address some of these issues by focusing on intraclass variation in motivation and its relation to other elements of the achievement syndrome in a lower-class sample (N = 93) of Negro and white elementaryschoolboys. The measure of n-Ach is based upon orally reported story responses to four TAT pictures. Race differences in achievement imagery by cue strength of picture are explored, as are the relationships between n-Ach and achievement values, interest in education, aspirations, and school achievement. IQ is controlled through sampling.

304

Appendix 1

Generally, the orally reported stories seem to result in greater intraclass variation in achievement imagery than usually reported. Crystallization of elements of the achievement syndrome varies by race.

A scheme of measuring educational outcomes of high risk programs (A prototypal model at Federal City College) by ANNIE R. DIAZ Federal City College, Washington D. C.

Identification and measurement of factors related to students' educational growth resulting from college experience require a comprehensive research plan. The difficulty of assessing educational outcomes could partly be explained by the nature of education as a social process which culminates in intangible end-products. The issue of inferring an institutions' effectiveness through student performance is far more complex in campuses where 'open admission' policy is practiced. It is for this new breed of colleges that this scheme has been designed. The study is to assist policy-makers in the development of a planning tool that could be effectively utilized in allocating limited resources among competing programs and in responding to issues of accountability and budget justification. The longitudinal study method will be applied to evaluate student performance. Survey instruments such as questionaires and standardized tests will be administered to cohorts of sample students. Patterns of academic growth among students will be examined as initial measures, short-term measures and long-term measures. The plan involves three phases: (1) Problem refinement; (2) data base development and analyses; and (3) initial stages of the longitudinal study. Participation of other colleges practicing 'open admission' will be solicited to provide the control group of students. The independent variables are students' characteristics (i.e., personal, socioeconomic and academic), and school characteristics by department. The dependent variables will encompass the cognitive, affective and behavioral changes in students. Generalized regression methods will be utilized to discover the basic underlying dimensions related to achievement and non-achievement of goals. The applicability of the evaluation model to urban institutions will be defined in the research outputs. This investigation is therefore expected to reduce further investment (i.e., time, money and manpower earmarked for planning) of other institutions in developing their assessment models.

Abstracts of submitted contributions not published in full

305

Testing for cognitive development: Intervention and acceleration by A.L. EGAN State University College, Buffalo Instructional practices in higher education reflect the assumption that students who pursue post secondary education enter with the skills that Jean Piaget has described in the stage of Formal Operations; some evidence exists that this is not necessarily the case. Many post secondary institutions now attract a variety of clienteles and operate under an open admissions policy, or a dual admissions policy, to admit a significant number of culturally distinct and 'high risk' students. By definition, these students will present lower scores on standard achievement tests. Frequently, such students are also given diagnostic testing to determine specific subject area weaknesses. On the basis of diagnostic testing, students are referred for remedial work or channeled into the regular academic stream; they may or may not be referred for tutorial assistance. Tutorial assistance is usually course-specific, with the focus on material presented by a particular instructor. At no time are these, or other, students tested to determine their level of cognitive development or exposed to learning experiences designed to facilitate their progress from one stage of cognitive development to the next. It is proposed to test all entering students for level of cognitive development. Testing results will then allow initial grouping to be done on the basis of level of cognitive development rather than achievement or diagnostic scores. Intervention strategies will be suggested for those students who have not yet achieved the level of Formal Operations to accelerate development to that level.

Diagnosis in a computer managed instructional system by R. GOBITS Eindhoven University of Technology, Eindhoven When students respond to fixed-answer format questions, the purpose of diagnosis is not to give feedback about the individual errors they eventually made. Feedback should be given about error-factors An error-factor is a shortcoming in the learning process which generates a class of errors. Errors belonging to the same error-factor may be made when responding to different questions. The purpose of diagnosis is to infer the error-factor from a certain pattern of errors.

306

Appendix

1

If the diagnosis is the basis for feedback within a CMI system, feedback should be given by the computer immediately after the student has taken a test. These two conditions, detecting error-factors and giving feed-back automatically, are the main issue of the research being done at the moment. Our aim is to construct structural and syntactical rules, to be used by the teacher when making a test, which govern the relations between questions to be responded to by the students. From the a priori knowledge of the relations between questions it will be possible to identify patterns of wrong answers, due to a certain error-factor, in advance. The means for formulating these structural and syntactical rules are given by prepositional logic and the mathematical theory of lattices. The research into the possibilities offered by these theories is still in an early stage.

Change factors in educational organizations which necessitate new approaches to measuring school climate by THOMAS M. GOOLSBY University of Georgia, Athens This paper attempts to show that schools are rapidly changing. Examples are given of these changes. Particular attention is given to viable 'alternative schools'. The factors of change are alluded to in the light of research about organizational change. The traditional process of change is presented as an integral part of the presentation. There is a need for organizational change to take place from a theory base rather than by the traditional process of moving from innovation to implementation to incorporation as a stable part of the organizational structure. A model of organization using a theory base is proposed and discussed which is one where both leaders and others who live and work in organization win by being treated in ways which make them feel important while contributing to the achievement of objectives.

Continuous objective-based trend testing by SHELLEY A. HARRISON State University of New York, Stony Brook Continuous Trend Testing is a systematic computer-supported procedure for keeping track of each student's progress on every objective of a course at

Abstracts of submitted contributions not published in full

307

frequent intervals throughout the year. It enumerates for the student the goals he is striving for and the progress he is making towards achieving them. It tells him which objectives he knew before instruction and which he still must learn; which objectives he achieved after being instructed and which he must study some more; which achieved objectives he remembers later in the course and which ones he must review. It gives the classroom teacher a wealth of trend data - for each student, for different student groups, and for the class as a whole. It reports such phenomena as non-learning of objectives, learning and retaining, learning and forgetting, gradual learning, delayed learning, interaction between objectives and high performance even before instruction. It helps the teacher make decisions before instruction. It helps the teacher make decisions on what to reteach, omit, review, condense, and resequence. It allows administrators to make informed decisions on course weaknesses, course materials, objectives, test items, sequencing, curriculum, testing and instructional methods.

Using local test data to specify skill hierarchies by GEORGE M. HUNTLEY Sir George Williams University, Montreal Local educational planning requires some hierarchies of skills which are uniquely needed by local students. Thus, a flexible strategy is needed for providing detailed information about what instructional hierarchies appear optimal for local students as a group, and what each student's learning profile is within a particular hierarchy of skills. At the heart of such a strategy would be a method for specifying a learning hierarchy which reflects local goals and current student abilities. A sample of students, stratified on ability, would be administered a battery of related skill tests which are based upon local goals. The results would then be used to enumerate one or more learning hierarchies that are 'tailor-made' for the local population of students. Of the several empirical techniques available to specify a skill hierarchy, two particularly useful ones are described in detail: (1) difficulty scaling followed by a partial-correlational analysis of possible triplet sequences of skills; and (2) a computer-based method for specifying hierarchies that maximize a measure of positive transfer among the skills while minimizing evidence against transfer (i.e., 'reversals'). While the former method is a useful heuristic, the latter is more theoretically in accord with learning theory. The increasing availability of computing facilities is putting both methods within the reach of more and more schools.

308

Appendix

1

As an illustration, both methods are applied to the results of a battery of writing tests given to junior-high school students. Finally, a nine-point program of testing and teaching is described for using either method of specifying learning hierarchies.

Sex bias in educational tests: A sociologist's perspective by MARLAINE LOCKHEED KATZ Educational Testing Service, Princeton The purpose of education, and hence of educational testing, should be to expand the life options of individuals. In too many cases, however, the life options available to men and women are rigidly defined. These definitions are reflected within educational tests. Constructors of new tests should avoid building tests and items which may reflect preconceptions about appropriate life options of men and women. The purpose of this paper is to present criteria for the construction and evaluation of tests which expand the options of men and women. These criteria may be applied to evaluate tests administered to any heterogeneous population. Data will be presented dealing with the occurance of sex bias in educational tests. Sex bias in tests may be identified according to seven different criteria: (1) the actual distribution of test items dealing with male and female actors; (2) the content of items and how male and female actors are portrayed; (3) the content of items relative to traditional or stereotyped male or female interests or skills; (4) the effects of (1), (2), or (3) above on male or female success on any item or items or on the test as a whole; (5) the overall predictive validity of the test for males and females with respect to some criteria such as future grades; (6) the use of separate norms for evaluating the test performance of males and females; (7) the uses made by counselors and others to predict future occupations, interests or skills of males and females as a result of their test performance, when such predictions separate male futures from female futures.

Abstracts

of submitted

contributions

not published in full

309

Evaluation of an experimental oral training program for deaf students by R I C H A R D K E E N E Utah State Board of Education, Salt Lake City Two-hundred-sixteen students with severe hearing handicaps were subjected to an educational program designed to improve their verbal comprehension and speaking capacities. (Manual signing was prohibited.) Three levels of treatment were considered: (1) those with no treatment; (2) those with one or two years' special treatment; and (3) those in the program from three to nine years. Sex and location of school were other dimensions in the three-way analyses. Concomitant variables used as covariates were: (1) age; (2) intelligence; (3) degree of perceptual-motor handicap; and (4) degree of hearing handicap. Three-way multivariate analyses of variance were employed as the principle statistical technique. These were augmented by ShefR tests of differences between means, partial correlations, and regression analyses. Substantial successes of the experimental treatment were found for all three groups of criteria. As might be expected, when the effects of social adjustment and attitude toward school were partialled from the data, the effectiveness of the treatment vanished. That is, all criteria were closely related to attitude toward school and social adjustment. It was recommended that the study of effects of varied educational treatments be continued and that the results of the study be disseminated with recommendations to develop similar programs for other hearing handicapped children.

The classroom behavior task: strength and sensitivity by ROBERT J. LESNIAK Pennsylvania State University,

Middletown

The Classroom Behavior Task is an example of a simulation utilized to assess strength and sensitivity criteria in teaching candidates. The task involves four role players as students who provide cues to which the teaching candidate responds and is then rated by two observers. The strength criteria are based on an ability to initiate structure, to remain consistent, to organize ideas and the ability to maintain self control. The sensitivity criteria are based on an ability to seek and utilize pupil feedback, exhibit empathy and commendation, use appropriate language level,

310

Appendix

1

and the ability to demonstrate an attitude of warmth. The Classroom Behavior Task simulates in 10 minutes the problems a teacher may face during several days of inner-city teaching. The correlations between the simulation and actual classroom teaching range from a rho of 0.33 to 0.75 depending on the criteria. There is an underlying assumption that values affect one's performance on the task more than skills do and the task can be adapted to cultural situations such as the differences between a classroom scene in Philadelphia, Pennsylvania, and Los Angeles, California. Early research findings support use of the task as a screening instrument or a diagnostic tool. The observation procedure has been utilized to determine the competence of urban teaching interns during their professional program of development. Attempts are also underway to develop a paper and pencil task which would correlate closely with the results of the Classroom Behavior Task

Learning: Enhanced through ongoing assessment and evaluation by LESLIE LEWIS Action Computer Associates, Oklahoma City During recent years several evaluation models have been developed, e.g., Stufflebean, Scriven, SWCEL. However, the details of implementation and application to specific situations seldom are supplied. The purpose of the proposed paper is to outline the evaluation techniques and to describe how they are used to evaluate a kindergarten individualized instructional program in special education in the areas of mathematics, language, reading, and psychomotor skills. All of the evaluative decisions relating to individuals and curricula are made on the basis of data provided from the administration of sets of criterion-referenced tests. The paper will describe the development of behaviorally stated objectives and their corresponding criterion-referenced tests, and will also describe the administration of the sets of tests at weekly intervals and nine-week intervals during the length of the instructional program. The testing is arranged in such a way as to permit each student to sequentially progress through his curricula. The scoring, analyses, and reporting of the test results by computer to teachers and other program personnel also will be described. The paper will also describe how curricula and instructional activities are modified based on the test results and how each student is recycled within his program based on need and analyses of these results. There will also be a brief discussion on the use of time sharing via a remote access terminal as a cost effective way of using computer hardware

Abstracts of submitted contributions not published in full

311

and providing for more rapid analyses of test data. In summary the paper will describe an application oi criterion-referenced testing to the evaluation of classroom instruction through the use of time sharing techniques with a remote access terminal. It will outline a practical set of formative evaluation techniques of classroom instruction and illustrate how they were used to evaluate the performance of special education students in kindergarten.

Formative evaluation through written simulation by JEAN E. LOKERSON and M. ELISE BLANKENSHIP Northern Illinois University, DeKalb Evaluative procedures utilized in the educational process include an everincreasing diversity of techniques and materials. Some of the most innovative of these methods seek to extend evaluation beyond mere summative data which determines the final acquisition of information. The technique of written simulation described and demonstrated provides an excellent tool for the extension of evaluation to the formative or process aspect. More specifically, the written simulation utilizes a simple and relatively inexpensive, yet versatile method for simulating a variety of problem-solving situations. As an innovative method with broad application to the field of education, the written simulation technique includes the following new elements: (1) Application to evaluative measurement of the formative processes in education through the use of simulations which place the participant in realistic, problem-solving and decision-making situations in which the approach used can be determined, examined, and evaluated. (2) Utilization of an inexpensive paper-and-pen, latent image format which can be produced on a standard office duplicating machine as well as with commercial printers and avoids the complexities and expense of computer hardware. (3) Availability of simulation programs of a linear and/or branching nature which is highly comparable to advanced computer software. For example, depending upon the construction of a specific problem, the participant's actions not only provide continual feedback information and modifications in the setting, but also eliminate the retraction of previous decisions. The basic technique, which has been under development and refinement by the medical profession, appears to present a very real and practical opportunity for adaptation and further development in the field of education. By the very nature of education as a sequential, highly interrelated, and decision-oriented process, the written simulation through evaluatable exposure of latent images opens new and innovative opportunities in formative evaluation. Presentation

312

Appendix

I

of some recently developed examples of these written simulations, using a latent image method, will seek to stimulate the audience to further extend and develop this promising evaluative tool in the field of education.

Monitoring the successful teaching of reading in open classroom situations by N I T A L O U G H E R Goldsmiths' College, London Theory: Systems theory, information processing at many levels. 'Reading' is social communication and involves the writer, the message and the reader who interprets the message. The teacher is part of the system, a link in the chain of transmitted and returned information. 'Reading' is an integrated system composed of the semantic, syntactic, graphic and phonic systems. Practice : The application of Ergonomics. Method: Problem solving and the development of aids and visual material. Monitoring devices: (1) Flowplans - to plan and control classroom organisation, preparation of resources, possible development and to record actual development. Flowplans are open-ended models. (2) Reading schedules - master plans to monitor the teaching of reading. They are models for teachers and pupils to use as a basis for record construction. BACKGROUND

This system of monitoring devices and records was developed with students over a period of five years. It was designed to be open-ended and aimed at producing teachers who are creative in their approach and are capable of taking full responsibility for the complex organisation and the development of the curriculum in the open classroom. The system is planned to achieve individualised teaching in a carefully controlled environment. Testing is an integral part of the teaching. The pupils are also encouraged to take an active part in keeping their own records of reading progress. The system is based on theoretical knowledge and has been tested in schools. It helps the student and teacher assess the value of the many reading schemes, reading programmes and material available, and to choose material most suitable for their school situation.

Abstracts of submitted contributions not published in full

313

ILLUSTRATIVE MATERIAL

Examples of flowplans, reading schedules and students' and teachers' records. Material used for teaching and testing.

Creative aptitude and instructional methods by MERVIN D. LYNCH Northeastern University, Boston RICHARD L. SCOTTI, Boston University, Boston The literature on creative aptitude is reviewed, some attributes which distinguish between high and low creative individuals are identified and implications are drawn for differential needs in terms of educational environment, teaching methods and evaluations of learning outcomes. High and low creatives differ as follows: (1) exhibit higher levels of anxiety and self esteem; (2) have more freedom of choice and more need for exercising this freedom; (3) spend more time in the periphery than in the center of attention in processing information; (4) tend to process and store information according to generative associative grammatical rules as opposed to list structures of detail; (5) tend to lag behind others in development of psychomotor skills and attention to detail; (6) have greater need for social acceptance and demonstrate greater empathy with others; (7) tend more to open themselves to encounter; and (8) exhibit more rebellion and least effortful behavior. Educational environments which provide an open classroom with individualized programmed instruction and emphasize verbal and pictorial associational content seem especially ideal for the effective facilitation of learning for the creative individual. But, evaluation of learning outcomes especially at the elementary grade levels in these teaching environments will need to focus more on testing for structural development than content detail; less emphasis on written encoding which requires development of manual dexterity and more emphasis on verbal encoding of test responses; more time provided in evaluation for retrieval, generating and organizing responses in testing; and greater freedom of choice in the content locus and order of response in evaluation. These and other problems of testing in open educational environments are considered.

314 Appendix 1

Narrative testing reports: The state of the art by WALTER MATHEWS University of Mississippi Much progress has been made in the psychometric aspects of the testing process; unfortunately equal progress has not been achieved in the reporting procedures that are used to communicate the results of the testing. The purpose of testing is to provide information for decision-making. Certainly the available information is not 'provided' unless it is understandable to the receiver of the report. As long as the reports are received by people who have some psychometric sophistication and an understanding of the test in question, there are few problems in interpretation - whatever the format of the report. But if others are to be users of the information provided by tests (e.g., the testee - a pupil, his parents, his teacher whose speciality is social studies not educational measurement), is the information that they need to make informed decisions provided to them in a usable form? In 1971 the author designed, developed, implemented and evaluated a computer-based test reporting system for the Iowa Tests of Basic Skills (ITBS) that generated testing reports for grade four pupils of the public shools of Madison, Wisconsin. Individual reports were generated for the teacher and the parents of each child and the reports were in a narrative format which virtually eliminated the use of the quantitative concepts of grade equivalent scores and percentile ranks. The reports were well-received with significant impacts measured when they were compared with the traditional columns-of-numbers reports. For the past two summers, the author has been working with William E. Cott'man, the Director of the Iowa Testing Programs at the University of Iowa, and with other members of his staff in order to design a more advanced and generalizable narrative reporting system for the Iowa Tests of Basic Skills. The progress of this project will be reported and its future plans will be shared. This paper, then, will sketch the history of narrative testing reports and will present the current state of the art. In addition, a projection of the potential impacts and applications of this technique in reporting the results of educational measurement will be offered.

Abstracts of submitted contributions not published in full

315

Some implications of testing bicultural-bilingual children within a monolithic educational system by PHILIP MONTEZ United States Commission on Civil Rights, Los Angeles The American education system has always been based on a monolithic assumption that all students attending public shools are 'Anglo urban middle-class, English speaking' individuals. This narrow point of view is an exclusive mandate which allows that success or failure must be based on the premise that all students fit in this category. The above definition of individuals seeking education is not only exclusive of those individuals who lack English speaking skills but makes no allowance to include the cultural realities which the non-Anglo student brings to the school. The cultural background of the individual is important to him because this is the 'self or the personality that he brings to education. American education has ignored the fact that bicultural-bilingual individuals exist who under the law need and deserve an educational program which is responsive to their way of life. In the United States there is a large Spanish speaking population. This group is made up of Puerto Ricans, Cubans and other persons of Latin extraction. In the Southwestern region of the United States resides the largest of this group, the Mexican American. His bicultural-bilingual antecedents do exist and his educational problems are compounded by the fact that he resides in a cultural buffer zone, the area between Mexico and the rest of the United States. He finds his cultural and linguistic background constantly reinforced by the close proximity to Mexico. This closeness is merely separated by an artificial line called the border. Culture has no respect for political boundaries. Human interaction, an important aspect of cultural exchange goes on daily. Cultural differences have not been taken into account sufficiently by those individuals who develop and design educational tests. They have not adequately considered or researched those variables which the Spanish speaking bring to the schools. The consequence is that the bicultural-bilingual child is more times than not misplaced, misevaluated and consequently misguided throughout his educational experience. His frustration, anxiety and disillusion within the educational process is insurmountable. The results are that he has one of the highest drop-out rates of any other ethnic group. If he remains in education, he is generally at the low end of the achievement level, misplaced in classes for slow learners, as well as being over-represented in classes for the educable mentally retarded. The student develops and maintains a negative image of himself and vents his hostility upon the system which fostered the attitude. In conclusion this paper will strive to encourage research and development of testing that will bring about a more equitable instrument to evaluate the

316

Appendix

1

true potential of bicultural-bilingual children. The implications for world-wide education are many. The process will move us closer to the cultural pluralism which must develop to ensure a truly global community.

Evaluation in instructional programming systems by ROBERT L. MORASKY State University of New York, Plattsburgh This paper attempts to illustrate the difference in the sequential placement, function and structure of evaluative instruments within the various suggested systems of programming instruction. In addition the relationship of such considerations as simulation and taxonomic categorization to different programming procedures is discussed. A survey of major instructional programming systems reveals that the preparation of evaluative instruments can occur at different times in the programming process and the sequential placement is indicative of the function and structure of the instruments. In the first system type, characterized as open and coupled, task analysis generally precedes objective writing and evaluative instrument preparation; hence, simulation of supra-system task conditions is of paramount importance in the stimulus and response components of the measurement instrument, and objectives serve only a communicative role. In the second type, characterized as a closed system, task analysis is either non-existent or plays a secondary function to objective writing; therefore, evaluative instruments reflect a relationship to objectives, but suprasystem task simulation is of less importance than taxonomic considerations. In the open, coupled systems a taxonomy of educational objectives is useful primarily as a means for categorizing evaluative behaviors, whereas in a closed system it can be the basis for selecting evaluative items.

Determining and implementing norms for the adjustment of examination marks and their implication for university entrance examinations by J.C. MULDER University of South Africa, Pretoria Students' knowledge is not the only contributing factor to examination results, but to a great extent the marks obtained are determined by the way in which the papers are set and the marks are awarded. Whenever an examination is written

Abstracts of submitted contributions not published in full

317

by a relatively large number of students, only small fluctuation of marks from year to year may be expected for the same subject at the same level. These marks should also show a more or less normal distribution. By setting norms conforming to certain requirements and adjusting examination marks by means of a computer according to these norms, a method was found whereby fluctuation of marks from year to year may be eliminated. By empirical investigation norms have been found for the different subjects acceptable to the authorities (in this case the Transvaal Education Department as well as the Joint Matriculation Board of South Africa). These norms have been implemented on the marks obtained in the matriculation examination of 1972 and resulted in a better distribution of marks and the speeding up of publication of examination results by a few days. From an examiner it is expected to set a paper by means of which the rank order of all the examinees may be determined as accurately as possible. To achieve this either questions of the essay type or the objective type or both may be included. The computer may be programmed to adjust these marks according to a normal distribution or any other distribution one wishes.

Increased organizational effectiveness through the use of the school organizational development questionnaire by DAVID J. MULLEN and THOMAS M. GOOLSBY University of Georgia, Athens Over a period of many years Likert has been conducting research in industry about factors in the structural, psycho-social, and managerial subsystems which contribute to increased organizational effectiveness. He describes this research and some of the results obtained in two books New patterns of management (1961) and The human organization (1967). What Likert and others have been discovering through research studies is that the supportive-participative management system achieves higher, or at least equal, productivity levels with fewer of the resentments, hostilities, grievances and breakdowns inherent in management systems using the traditional principles of administration. In light of these findings, Likert (1961, p. 61)raises an important question: If this pattern is so consistent, why is it that the majority of supervisors, managers, and top company officers have not arrived at these same conclusions based upon their own experiences? His answer is that most organizations deal with inadequate measurement processes. Organizations to often secure measurements dealing only with end result

318

Appendix

1

variables such as production, sales, profits and percentages of net earnings to sales. Likert states that there is another class of variables which significantly influence the end result ones. The other variables are seriously neglected in present measurement practices. The organizational variables are defined by Likert in (1967, p. 29) the following manner: Causal variables include the structure of the organization and management's policies, decisions, business and leadership strategies, skills and behavior. The 'intervening' variables reflect the internal state and health of the organization, e.g. the loyalities, attitudes, motivations, performance goals, and perceptions of all members and their collective capacity for effective interaction, communication, and decision making. The 'end-result' variables aredependent variables which reflect the achievements of the organization, such as its productivity, costs, scrap! loss, and earnings. Developmental studies have been conducted in several school systems in Georgia to top the causal and intervening variables in a systems approach to school organizational achievement. Over 5,000 students, teachers, principals and other school certified staff have been used in these developmental studies. Appropriate statistical procedures have been applied and the instrument - School Organizational Development Questionnaire (SODQ)- is now ready for a national study. The National Institute for Educational Research is currently considering a proposal to conduct such a study. It is felt that any consideration of end-result variables through educational achievement testing is incomplete without some consideration of causal and intervening variables. The School Organizational Development Questionnaire purports to give meaningful consideration to these causal and intervening variables.

Testing in the informal educational setting by DAVID O. ONGIRI Pennsylvania State University, Middletown This paper will review recent and current means of evaluating and testing student progress and performance in informal educational systems with special emphasis on the United States. Special consideration will be devoted to the problem of how an educational system which places primary importance on the needs of the individual child

Abstracts of submitted contributions not published in full

319

can enhance its instructional effectiveness by an individualized testing program geared to the needs of each child. The following questions will be discussed in relationship to this subject: (1) When does testing fail in the informal school setting? (2) What type of test works best in the informal school ? (3) Can the testing needs of the informal school be adequately met by currently available tests or must new tests be developed specifically for informal education? (4) Can a mass testing program advance the goal of individualization? (5) How do the testing needs of the informal school organization differ from those of the conventional school? (6) Do the results of testing in the informal classroom vary with the cultural background of the students?

Educational anthropology and educational testing by ANNETTE ROSENSTIEL William Peterson College, Wayne Educational tests have been frequently attacked as culturally biased instruments which measure neither demonstrated nor potential ability. For example, the original intelligence tests devised by Binet were geared to French cultural norms and had to be revised for use in America before their results could be accepted as valid. The current, controversial Jensen theory of genetically inherited intelligence can be completely refuted only when intelligence and achievement tests can be devised that are either cross-culturally valid, or have been adapted to fit the norms of the culture of the individual being tested. Criterion-referenced tests are also by their very nature norm-referenced tests, since the criteria they utilize are culturally determined. This paper will examine in detail some of the problems that have arisen in the construction of previous educational tests. It will deal with a broad range of cultural problems resulting from the administration of instruments which either failed to conform or were in direct conflict with the prevailing norms of the culture in which they were administered. The experience of the anthropologist with education in primitive societies has given him considerable insight into the cultural factors operative in the educational process. The Educational Anthropologist analyses the interaction of culture with the various aspects of the educational process. Working in cooperation with the educator and the psychologist, he can help to devise instruments which will provide a fairer assessment of the intelligence, achievement and creativity of a given individual, measured in terms of the norms of his own culture, and of criteria that are culturally significant.

320

Appendix

1

Measuring understanding directions

by young children of oral

by NANCY ROSER University of Texas, Austin The performances of 225 Black, Anglo, and Mexican-American kindergarten boys and girls are compared on a test of following oral directions, containing key words and concepts most frequently occurring in the directions for standardized reading readiness tests. Children were first assessed individually as they manipulated small objects to demonstrate understanding of spatial concepts, similarities/differences, and left to right directionality. Group testing followed, and objects were replaced by pictorial representations with similar operations being required but in a two-dimensional space. Comparisons among ethnic groups and between sexes were made using a two-way analysis of variance and Scheffe's test for post-hoc multiple comparisons when appropriate. Anglo children scored significantly higher on both sections of the test; black children scored significantly higher than did Mexican-Americans on the individual portion. There were no significant differences between Blacks and Mexican-Americans on the group test. Implications of the varying levels of difficulty of key words and concepts for different racial-ethnic groups are discussed.

The impact and change of diagnostic testing in context of educational evaluation by ERICH SCHOTT University ofUlm In theory and practice of 'educational measurement' there can be distinguished two divergent approaches of evaluation: An individual centered approach primarily used for discrimination of individuals for the purpose of selection; on the other hand an institutional and curricular approach primarily used to evaluate educational institutions, programs and methods. Both directions of evaluation are based on different value systems, ideologies and educational ideas. They differ especially in the extent in which they express a belief in the necessity and the possibility of changes concerning individual and institutional determinants of behavior. Because of their different aims each approach needs its own theor-

Abstracts

of submitted

contributions

not published in full

321

etical foundation implying special principles of construction. The problems of educational testing partly derive from the fact that there often is no distinction made between both approaches and their different implications.

Evaluation and testing in educational accountability by M A U R E E N SIE Michigan Department of Education, Lansing 'The first results from Michigan's $22.5 million "accountability model" for compensatory education appear to dispute the contention that these programs can't succeed. Under the state-funded "compiled" program, schools establish performance objectives, representing at least one grade level gain, for participating students. The program now reaches 112,000 elementary school children who rank in the bottom 15th percentile in math and reading in 67 school districts. In order for the school district to receive a full $200 per pupil grant in subsequent years, each student must achieve at least 75 percent of the specified objectives.' {Education USA, December 18,1972). The state of Michigan is conducting a continuing large scale social action experiment through its compensatory education program. Two hypotheses are being tested: (1) can school districts be held accountable for educating the disadvantaged low-achieving pupils who primarily reside in the cities; and (2) what effect does money have on educational programs. Norm-referenced and criterion-referenced tests were used in the evaluation and testing of Michigan compensatory education programs. Achievement test results and related information on instructional programs and financial expenditures yielded a massive array of statistics for school year 1971-1972. Program components were delineated to differentiate high and low achieving school districts. Demographic variables such as percent of minority and socioeconomic status were also investigated in this study. This evaluation study demonstrated that a large scale and systematic approach can be effectively applied to pin-point program effectiveness, delineate costs, measure pupil performance and test the basic assumptions underlying educational accountability.

322

Appendix

1

Formulation and evaluation of an educational measurement model by RICHARD SMITH Northern Illinois University, DeKalb The project represents an attempt to formulate and evaluate an educational measurement model. More specifically, the model represents an attempt to incorporate Gagne's learning taxonomy into Tyler's model of the 'educational act'. The project activities were concerned with: (1) Structuring a unit of beginning junior high school chemistry beginning with the most basic statement possible and then proceeding by adding new statements in such a way that no more than one new concept was added with each statement or rule. (2) Constructing a concept test for each of the key concepts within the statement or rule (the concept items consisted of questions which required the learners to identify examples and non-examples of the concept). (3) Constructing a criterion test in which the correct answers involved the positive transfer of the rule or statement. (4) Administering the tests after normal classroom instruction. (5) Pretesting to determine the extent to which the learners had attained the concepts, as well as the statements and rules. (6) Having the learners go through programmed materials concerned with the identification of examples of the concepts they had missed, as well as some frames concerned with the combining of the concepts to form the rules or statements. (7) Post-testing to determine the effect that teaching the missed concepts had on the ability to positively transfer the statement or rule. The pre-test data were analyzed to determine the extent to which the percent of students who were able to positively transfer the statements increased with the number of concepts attained within the statement. The post-test data were analyzed to determine the extent to which students who moved from not being able to transfer the statements on the pre-test to being able to make the transfer on the post-test increased with the number of concepts attained.

Abstracts of submitted contributions not published in full

323

A model for personalized educational testing by ROBERT J. STARR University of Missouri, St Louis The failure of teachers to systematically consider the variables found in testing has impeded movement toward personalized teaching-learning. The author will present an analytical model for personalized educational measurement which identifies and quantifies previously ignored factors in measurement and evaluation. By conceptualizing these variables within a large group format, the model permits an instructor to structure personalized testing for each student in his/her classroom. Basically, two information sources are identified, the teacher and the pupil. Each segment is then tapped on a regular schedule to allow recycling of imput. Each component may now quantify any unanticipated outcomes of instruction (including student self-selected goals) as well as the anticipated outcomes of instruction. These expected behaviors are indicated by the production of a table of specifications (Bloom, Hastings & Madaus, 1971). With aptitude (Carroll, 1963) of utmost importance in pacing only the student is able to reliably identify the attention that he/she has given to learning; hence, information regarding their perceptions of goal achievement must be collected. Summing the numerical quantities results in a series of ratios for each student. Each set forms an individualized and personalized profile of learning which is used to address each pupil to specifics on a testing instrument. With teacher ratings a part of the ratios, any questions regarding academic quality should be minimized. Thus, implementation of the model indicates that personalized educational testing is possible regardless of the mode or media of learning.

Large-scale essay testing: Implications for test construction and evaluation by DORIS J. THOMPSON and R. ROBERT RENTZ University of Georgia, Athens One of the problems which arises during the process of test construction involves the selection of test items which will most accurately measure specific learning outcomes. Often compromises must be made because of constraints

324

Appendix

1

which make the selection of a specific item type impractical for the resources available. These constraints may involve time, cost, or difficulties in objective measurement of certain types of outcomes. The essay item is one which had been limited in use because of these constraints. The state of Georgia, USA, is currently involved in a statewide program of assessment which is designed to measure competency of students at the junior year of their college program in the areas of reading and writing. The decision was made to include an actual writing sample of each student tested in order to accurately evaluate his ability to write. Students are requested to write a 30 minute essay on a specified topic. Since this examination is a requirement for all students within the university system of Georgia, it became necessary to develop methodology which would be effective in the assessment of essays administered on a large scale. The method which has been used during the past year and a half on more than 35,000 essays has involved grading essays using a wholistic-impressionistic approach to grading. Each essay is graded by three experts in English composition who score the writing samples by comparing each essay with a 'model' essay on the same topic. All raters are given a standard set of directions for scoring prior to the actual scoring period. After an essay has been rated by a grader, he grades the essay on the basis of his overall impression of the paper and its comparison with model essays which have been previously assigned ratings based on a specified performance criterion. The ratings range from a rated failure of 1 to an excellent rating of 4. The reliability of this scoring procedure for large-scale essay grading has been exceptional. Presented are the specific procedures used which have made this statewide program not only feasible but also efficient in terms of time and cost. Evidence will also be presented which supports the reliability of the scoring procedures involved.

Time-limit tests by AD. H. G. VAN DER VEN University of Nijmegen In this paper a report is given of a research project on one of the two major types of mental tests, those with a time-limit. Time-limit tests, as opposed to task-limit tests, allow the examinee only a limited time to complete the test. The total number of items employed is usually so large that not even the most accomplished examinee can finish in the allowed time. Any psychological test may be looked upon as an experimental situation. The examinee's performance depends on the conditions imposed by the design of the test. Time-limit tests

Abstracts of submitted contributions not published in full

325

are defined as follows: (1) The test is administered with a fixed time-limit. (2) The test instruction induces time-pressure, e.g., by the expression: 'Work as fast and as accurately as possible.' (3) To each item an incorrect answer is possible. This condition implies that time-limit tests are not equivalent to pure speed tests (Gulliksen, 1965, p. 320). (4) Each subject can answer each item correctly if: (a) no time-limit is imposed; and (b) no answer is given unless the subject is perfectly sure. The conditions (a) and (b) are called liberal time conditions. (5) All items are multiple choice items. (6) The number of items is such that not even the most rapid examinee can possibly complete all items within the time allowed. This condition guarantees maximal differentiation. Typical examples of time-limit tests are the General Aptitude Test Battery and Thurstone's Primary Mental Abilities. Customary practice in both time-limit and task-limit tests is to use the total number of right items as the test score. In addition to this score one may consider both the number of items attempted and the number of wrong items. In this paper an error score model for time-limit tests is presented. Each person i is assumed to have a constant probability jti of answering each attempted item correctly. Therefore, for a particular person, given that he has attempted a certain number of items, the test score of the number of items correct follows a binomial distribution. The parameter n is called precision. Another parameter a, called speed, and related to the number of items attempted, was introduced. A second assumption is that within a specified person i the variance of the number of items attempted is solely dependent on di. A third and final assumption is that over persons speed and precision are uncorrelated. The model is tested by comparing the observed and expected correlations between the number of items right and attempted, right and wrong, and wrong and attempted. Two time-limit tests are used: an American test, the General Aptitude Test Battery and a Dutch test (the interest, school-achievement and intelligence test.) It is possible to find observed score equivalents for the parameters a and 7t. For a the number of items attempted and for iz the proportion of items right can be used. An extensive rationale is given for the third - rather counter-intuitive - assumption, which states that speed and precision are uncorrelated over persons, where they probably are completely dependent within persons. The rationale is given by assuming, that the true speed - aj - and the true precision — Jti — are stochastic variables - gi and ill - themselves. Further, it is assumed that any increase in speed leads towards a decrease in precision and vice versa. More precisely, there exists a linear relationship - 7Ti = ai + biOi - between speed and precision. For all practical purposes the decision was made not to start from a

326

Appendix

I

model based on a nonlinear relationship. Under the assumption that the slope — bi — is constant over in individuals i, it can be shown, that: Ei { a 2 (aO } p (a, tc) Ei{cr?(g,)} + af { E i ( a O } So, a relative increase in the between persons variance of speed leads to a decrease in the correlation. It also is shown that this relation a fortiori holds for the correlation between the number of items attempted and the proportion of items right - p (a, 2) -•

Information feedback system in three school systems by WILLIAM F. WHITE University of Georgia, Athens As an example of 'formative evaluation and diagnostic', the Information Feedback System (IFS) implemented in three school systems has reached a high level of success. Behavioral objectives are stated in ten basic components. Instruments and data gathering devices are applied to the total system. Data processing and a timely return of analysis provide the material for decision making. Information is monitored and updated for each child in the program. The IFS has been a working model for four years. Two of its unique characteristics that mark its difference from other formative models are: (1) The whole child is considered in the educational process. Medical-dental, nutritional components are supplemented with social work and psychological services. Parent involvement and staff development are active components in the model. (2) Information feedback on all instructional processes, e.g., criterion referenced tests, item analyses of standardized tests, and affective type variables, become a part of information exchange in social settings in which decisions are made. Staff are not merely presented computer printouts, or written reports. All data are examined by personnel and decisions are made in a public setting. Data supports the hypothesis that children learn more in quantity and quality in the information system and more effective teaching is demonstrated by the behavioral analysis of the classroom operation.

Abstracts of submitted contributions not published in full

327

Kindergarten evaluation of learning potential (KELP). A new approach to testing by JOHN A. R. WILSON University of California, Santa Barbara, MILDRED C. ROBECK University of Oregon, Eugene KELP was developed to enable teachers to evaluate the probability of school success by actual performance in the classroom. Instruction could be accelerated and intensified or decelerated and given in smaller steps as a result of feedback from the use of specified materials with clearly stated behavioral objectives. KELP demonstrates that it is possible to teach young children to learn at complex levels of functioning and that regular classroom teachers can succeed at both the instruction and the evaluation of potential. Items are structured to require learning at the association level, conceptualization of relationships within the associations, and to be creatively self-directing after they have mastered these materials. KELP emphasized the importance of recording, on a daily basis, the success of the child. The on-going record provides formative evaluation information to the teacher and helps him continually readjust his teaching in the light of the information available to him. In an open classroom KELP can be used to determine when reading or formal arithmetic can profitably begin. Use of the materials provides developmental experience as well as evaluative information. The principles on which KELP was constructed can be used to build materials for any grade level and in any subject content field. Justification for basing evaluation of future success on the results of teaching by instructors with different qualifications and expertise can be found in American school experience where college grades are better predicted by high school grades than by any other index. The best justification for KELP is that it works.

Measuring language comprehension by FRANK WOLFE Northern Michigan University, Marquette Each author constructs out of the available particles of his language an accounting of his impressions. Language forms become construction become

328

Appendix

1

description; by this process the form manipulator attempts to make evident his perceptions. Any text, whatever the medium, is thus not simply a mimesis a re-production of a segment of the Reality which 'really' exists as an orderly, already completed Separate - but is itself a postulated reality. An author reports, in language report frames. Each reader brings to the text memory traces of his own previous perceptions. Whether through intuition, sensory apparatus, or (more likely) vicarious experience, each cf us identifies Reality as a concatenation of perceptual fixes. Reader-to-text is a human-to-human context; reality is whatever each human has perceived. Each reaches the other by way of a construction. 'What does this author say?' is the reader's first question. 'Why should I accept this accounting as a viable one?' (John Dewey would have him question 'warrantability') is his second. The minimal impression-carrying structure of that description is at once both a report frame and an evidence frame. But before a reader can question warrantability, he must first identify what has been reported. Yet further, before he can judge what has been perceived, he must know how that perceiving is accomplished in language. This, I submit, is syntax. These are the rudimentary report structures: Doing Frames: Being Frames'. (1) Something does. (1) Something is described. 'Bears hibernate.' 'Bears are fat.' (2) Something does to another. (2) Something is something else. 'Bears eat fish.' 'Bears are clowns.' These are the evidence frames. (Sufficiency as 'proof is decided by the reader.) The Assertion Frame: Xis (i.e., 'X exists - so.') The Observation Frame X does (i.e., 'Xreveals itself- in this behaviour.') The Evaluation Frame: A'exists as expected (Xis consistent with a certain criterion). The Directive Frame: X is expected to do (X is directed to behave as would be consistent with the fulfillment of a certain task). By these terms, a report frame consists of a verb, its subject, and its complement (if any). A 'sentence' is a formalism made up of one or more expression frames, as punctuated. There cannot indeed be any single method of proof, because what is said is not everywhere the same. ('Subject' and 'predicate' is not a satisfactory reduction.) An evidence frame is keyed not by its structure, but by what it might prove if it were identified (tentatively) as one of four proofing devices.

APPENDIX 2

List of participants

Robert D. Abbott J. Stanley Ahmann Michael W. Allen Marshall Arlin Patricia Arlin

S. Amstrong C. Arnold Dan Ashler W.L. Bashaw N. Baumgart P.H. Been M. Elise Blankenship O.O. Bo Jane A. Bonnell C. Boonman

California State University, Fullerton, California 92634, U.S.A. National Assessment, 1860 Lincoln, Suite 300, Denver, Colorado 80203, U.S.A. Ohio State University, Office of Academic Affairs, 1080 Carmack Road, Columbus, Ohio 43210, U.S.A. North Carolina Advancement School, Winston-Salem, North Carolina 27101, U.S.A. University of North Carolina, Educational Research/ Educational Psychology Area, Greensboro, North Carolina 27412, U.S.A. The Polytechnic, Queensgate, Huddersfield, Yorkshire, England Spui 21, Amsterdam, The Netherlands 9 Asbury Avenue, Melrose Park, Penna. 19126, U.S.A. University of Georgia, Educational Research Laboratory, 115 Fain Hall, Athens, Georgia 30602, U.S.A. Macquarie University, North Ryde N.S.W. 2113, Australia Strausslaan 62, Groningen, The Netherlands Northern Illinois University, Department of Special Education, DeKalb, Illinois 60115, U.S.A. Pedagogisk Forskningsinst. Box 1092, Universiteteti Oslo, Blindern, Oslo, Norway Grand Rapids Public Schools, 244 Woodside, N.E. Grand Rapids, Michigan 49503, U.S.A. Instituut voor Pedagogische en Andragogische Wetenschappen, Utrecht University, Trans 15, Utrecht,

330 List of participants

T. G. Borgesius J. M. Bosland Frank E. Boxwill J.R. Brassard Joe H. Brown

H.C.D. de Bruyne W.J. Coetzee Roy Cox

B. Creemers

Hans F.M. Crombag J.H. Daniels Larry W. DeBord Annie R. Diaz R.F. Doe C. van Dorp A.L. Egan O. M. Ewert H.K. Fisher A.V.C. Fleischer Joseph Froomkin W. George Gaines

J.C. Garbers R. Gobitz

The Netherlands Utrechtsestraat 33-3, Arnhem, The Netherlands Jan Smitstraat 25-IV, Amsterdam, The Netherlands 496 Jefferson Street, Westbury, New York 11590, U.S.A. Faculty ofEducation, Ottawa University, 1245 Kilborn, Ottawa, Ont. KIN 6N5, Canada University of Kentucky, College of Education, 147 Washington Avenue, Lexington, Kentucky 40506, U.S.A. Varkenmarkt 2, Utrecht, The Netherlands Education Department, Orange Free State, South Africa University of London, University Teaching Methods Unit, 55 Gordon Square, London WC 1H Ont., England Instituut voor Pedagogische en Andragogische Wetenschappen, Utrecht University, Maliebaan 103, Utrecht, The Netherlands Educational Research Center, University of Leyden, Stationsweg 46, Leyden, Netherlands Griegplein 237, Schiedam, The Netherlands Institute of Urban Research, University of Mississippi, University, Mississippi38677, U.S.A. Federal City College, 1420 New York Avenue, N.W. Room 909, Washington, D.C. 20036, U.S.A. I.E.T., Univ. of Surrey, Guildford, Surrey, England Jan Willem Frisodreef 59, Katwijk, The Netherlands State University College, 1300 Elmwood Avenue, Buffalo, New York 14222, U.S.A. Hustadtring 77, Bochum 4630, West Germany Room 1643, Mowat Block, Queen's Park, Toronto, Ontario M7A 1M5, Canada Stengàrdsvaenge 72,2800 Lyngby, Denmark 1015 Eighteenth Street N.W., Washington, D.C. 20036, U.S.A. Louisiana State University, Department of Elementary and Secondary Education, Lake Front, New Orleans, Louisiana 70122, U.S.A. Mauritsstraat 76, Utrecht, The Netherlands Eindhoven University of Technology, P.O. Box 513, Eindhoven, The Netherlands

List of participants

331

Measurement and Evaluation Service, Department of Education, Quebec, Quebec, Canada University of Georgia, College of Education, Athens, Thomas M. Goolsby Georgia 30601, U.S.A. Psychological Laboratory, University of Amsterdam, Adriaan D. de Groot Weesperplein 8, Amsterdam, The Netherlands Dato N.M. de Gruijter Educational Research Center, University of Leyden, Stationsweg 46, Leyden, The Netherlands Derde Hambaken 7, Den Bosch, The Netherlands J. Gulmans Institute for Educational Research, Mölrolalsvägen 36, J.E. Gustafsson 412 63 Göteborg, Sweden Institute for Educational Research, University of Oslo, K.A. Hagtvet Box 1092, Oslo 3, Norway Vondelstraat 66, Amsterdam, The Netherlands A. Haitsma University of Amsterdam, Weesperplein 8, Amsterdam, C. Hamaker The Netherlands Public Systems Research Inc. 149 Main Street, East Shelley A. Harrison Setauket, L.I., New York 11783, U.S.A. M. van Hemert-Elstrodt Prinsengracht 903, Amsterdam, The Netherlands Pedagogiska Institutionen, Umea University, Umea, Sten Henrysson Sweden Trinity College, Dublin, Ireland J. Heywood Tallbacksvägen 14, Uppsala, Sweden L.G. Holmström C.I.T.O., Post Office Box 1034, Arnhem, The NetherG. van der Hooft lands George M. Huntley School of Education, Sir George Williams University, Montreal, Canada L. Hiirsch Universität Bern, Forschungsabteilung, Sennweg 2, 3012 Bern, Switzerland K. Ingenkamp Hauptstrasse 53, Leinsweiler, Pfalz 6741, West Germany A.M. St. James 51 Hallowell Street, Montreal 215, Quebec, Canada G . G . H . Jansen C.I.T.O., P.O. Box 1034, Arnhem, The Netherlands Eugene Jongsma 1504 Melody Drive, Metairie, La. 70002, U.S.A. J.H. Jooste Transvaal Education Dept., P.O. Box X76, Pretoria, Transvaal, South Africa Max van der Kamp Keizersgracht 209IX1, Amsterdam-C, The Netherlands Richard Keene Utah State Board of Education, 1400 University Club Building, 136 East South Temple Street, Salt Lake City, Utah 84111, U.S.A. Pijlstaartlaan 4, Vinkeveen, The Netherlands I. Labordus Ashorne Hill College, Ashorne Hill, nr. Leamington A. G. Lees J. G. Godbout

332

List of

participants

Spa, Warwickshire, England IPN Universität, 23 Kiel, West Germany Putman City Schools, 5417 N.W. 40th, Oklahoma City, Oklahoma 73122, U.S.A. M. van der Linden-Mulder Oude Kraan 68, Arnhem, The Netherlands Marlaine Lockheed Katz Educational Testing Service, Princeton, New Jersey 08540, U.S.A. Northern Illinois University, Department of Special Jean E. Lokerson Education, DeKalb, Illinois 60115, U.S.A. University of Londen, Goldsmiths' College, New Cross Nita Lougher S.E. 14, London, England Northeastern University, College of Education, Mervin D. Lynch Boston, Mass. 02115, U.S.A. Southern Regional Examination Board, 53, London Henry G. Macintosh Road, Southampton, S094YL, England 1638 N. Sycamore, Tucson, Ariz. 85712, U.S.A. Elinor R. Markert University of Maryland, Zengerstrasse 1, Heidelberg Ben Massey 6900, West Germany Rue de Berne 36, Geneva 1211, Switzerland P.E.J. Mengal Educational Testing Service, Princeton, New Jersey, Samuel Messick U.S.A. R.K. Instituut Henricus, Nijmeegse Baan 61, Nijmegen, M.J.G. Mommers The Netherlands Gérard J. A. G. Monfils CLAD - Faculté des Lettres, University of Dakar, Dakar, Senegal, West Africa United States Commission on Civil Rights, Suite 1015, Philip Montez United States Federal Courthouse, 312 North Spring Street, Los Angeles, California 90012, U.S.A. Merelstraat 105, Leiderdorp, The Netherlands J. M. Moonen State University of New York, Faculty of Social Robert L. Morasky Sciences, Plattsburgh, New York 12901, U.S.A. 208 Third St. S. W., Waverly, Iowa 50677, U.S.A. M. Moy University of South Africa, Department of Empirical J.C. Mulder Education, P.O. Box 392, Pretoria, South Africa College of Education, University of Georgia, Athens, David J. Mullen Georgia 30601, U.S.A. M. Lehrke Leslie Lewis

Selma Mushkin K.R. Myers S.C.M. Naude P.R.T. Nel D.L. Nuttall

3620 Prospect St., Washington D.C. 20007, U.S.A. Box 97, Madison, Pa. 15663, U.S.A. Private Bag X122, Pretoria, South Africa National Education Dept., Pietermaritzburg, South Africa 25 The Oaks, London Road, Bracknell, Berskhire

List of participants

333

RG12 ZXG, England K.P.C., Oranje Nassaulaan 6-8, 's Hertogenbosch, The Netherlands Dept. of Educational Administration, The University F.D. Oliva of Calgary, 2920 - 24 Avenue N. W., Calgary, Alberta, Canada B. Olivier Waalstraat, Kaapstad, South Africa David O. Ongiri Pennsylvania State University, The Capitol Campus, Middletown, Pennsylvania 17057, U.S.A. M.W.H. Peters-Sips C.I. T.O., P.O. Box 1034, Arnhem, The Netherlands R. Pike 16 Elderwood Drive, Toronto, Ontario M5P 1W5, Canada G.J. Pollock 4 Clifford Road, Stirling, Scotland C. A. Poortvliet Vasco da Gamastraat 4III, Amsterdam, The Netherlands T. Neville Postlethwaite International Institute for Educational Planning, 9, Rue Eugène - Delacroix, Paris 16e, France Dr. Wumkesstraat 29, Buitenpost, The Netherlands L. Postma Hermodsgade 28, Copenhagen, Denmark B. Prien Room 435, County Hall, London S.E. 1, England H. Quigley Human Sciences Research Council, Private Bag X41, J.H. Robbertse Pretoria, South Africa A.E.N. Rommes Technische Hogeschool Twente, P.O. Box 217, Enschede, The Netherlands Abby G. Rosenfield Harvard University, 1450 William James Hall, 33 Kirkland Street, Cambridge, Mass. 02138 Annette Rosenstiel 4 Old Mill Road, Manhasset, New York 11030, U.S.A. Dept. Educational Psychology, Educational Faculty. I. Roth University of Stellenbosch, Stellenbosch, Cape, South Africa Gavriel Salomon School of Education, Hebrew University of Jerusalem, Jerusalem, Israël J. P. Schnitzen 4372 Fiesta Lane, Houston, Texas 77004, U.S.A. E. M. Schoo Assumburg 52, Amsterdam, The Netherlands H.H. Schoonenberg Oranje Nassaulaan 6-8, 's Hertogenbosch, The Netherlands J. J. F. Schroots Nederlands Instituut voor Praeventieve Geneeskunde, Wassenaarseweg 56, Leyden, The Netherlands Parkstrasse 14, 7900 Ulm, Donau, West Germany Erich Schott 155 Lamartine, Chateau Guay Centre, Quebec, J. M. Sincennes Canada O. Sletta University of Trondheim, Institute of Education, M.J.G. Nuy

334

List of participants

B. A. Smith Charles W. Smith

Richard Smith B.W.G.M. Smits Richard E. Snow W. van Soest J.W. Solberg K.A. Spelling Robert J. Starr Q. van Staveren A.L. Stramnes D. Thio Doris Thompson Marion L. Thornhill D. Tromp A. Vasquez A. H. G. van der Ven A.C. Verhoeven A.D. Verhoeven Virginia H. Vint M.J. M. Voeten Ingemar Wedman P. Weeda

William F. White Wynand H. Wijnen

7000 Trondheim, Norway Florida International University, Tamiami Trail, Miami, Florida 33144, U.S.A. Northern Illinois University, Department of Elementary Education, Williston Hall no. 323, DeKalb, Illinois 60115, U.S.A. Northern Illinois University, DeKalb, Illinois 60115, U.S.A. Malvert 62-29, Nijmegen, The Netherlands 937 Lathrop Place, Stanford, Calif. 94305, U.S.A R.I.T.P., Herengracht 510, Amsterdam, The Netherlands C.I.T.O., P.O. Box 1034, Arnhem, The Netherlands DLH - Emdrupvei 101, 2400 Copenhagen NV, Denmark University of Missouri, School of Education, 8001 Natural Bridge, St. Louis, Missouri 63121, U.S.A. Patmosdreef 64, Utrecht, The Netherlands Sorlistien 21,1473 Skarer, Norway C.I.T.O., P.O. Box 1034, Arnhem, The Netherlands 160 - Apt. 3, Scandia Circle, Athens, Ga. 30601, U.S.A. 5417 N. W. 40th, Putnam City, Oklahoma, U.S.A. Spui 21, Amsterdam, The Netherlands 3904 N. Prospect, Milwaukee, Wisconsin 53211, U.S.A. Okapistraat 35, Nijmegen, The Netherlands Nassaulaan 22, Oegstgeest, The Netherlands De Haspel 10, Maiden 6844, Gld., The Netherlands 1550 Heights Boulevard, Winona, Minnesota 55987, U.S.A. Weezenhof65-20, Nijmegen, The Netherlands Pedagogiska Institutionen, Umea University, Umea, Sweden Instituut voor Pedagogische en Andragogische Wetenschappen, Utrecht University, Maliebaan 103, Utrecht, The Netherlands University of Georgia, Aderhold Hall, Athens, Georgia 30602, U.S.A. Center for Research in Higher Education, University of Groningen, Oude Kijk in 't Jatstraat 28, Groningen, The Netherlands

List of participants B. Wilbrink John A.R. Wilson Richard M. Wolf Frank Wolfe C. Wright Florence S. Young Anton Zrzavy

335

C.O.W.O. Spui 21, Amsterdam, The Netherlands Department of Education, University of California, Santa Barbara, California 93106, U.S.A. Teachers College, Columbia University, Box 165, New York, N.Y. 10027, U.S.A. Northern Michigan University, 303 North Fourth Street, Marquette, Michigan 49855, U.S.A. Sorisdale, Lanark, Scotland Shenandoah County School Board, Madison District, Edinburg, Va. U.S.A. Forsthausgasse 1513116,1200 Vienna, Austria

Name index

Abbott, R.D., 297 Ahmann, S., 75 Airasian, P.W., 112, 116, 120, 135, 203, 216 Alkin, M.C., 106 Allen, M.W., 293, 296 Anderson, J.R., 259, 260, 270 Anderson, R.C., 220, 224, 225 Arlin, P., 298 Atkinson, R.C., 204, 216, 258, 270 Baker, E.L., 120, 135 Baker, R.L., 127, 135 Barrows, T., 173, 177 Bartlett, F.C., 264, 270 Barten, K., 217 Bashaw, W.L., 298 Becker, H.S., 198, 202 Berliner, D.C., 234, 246, 248, 251, 253 254 Blaine, G„ 194, 202 Blankenship, 311 Block, J.H., 118, 142, 146, 147, 179, 191, 203, 204, 216 Bloom, B.S., 14, 27, 104, 106, 178, 179, 181, 183, 191, 203, 204, 209, 216, 217 219, 224, 257, 270, 293, 296 Bonnell, J. A., 299 Bormuth, J. P., 125, 135 Born, D.G., 220, 222, 224 Boxwill, F.E., 300 Bracht, G.H., 234, 246, 284, 285, 289 Bradley, J.P., 174, 177 Brigman, S.L., 298 Brim, O.G., 223, 224

Brown, J.H., 301 Brown, N.W., 302 Bunderson, C. V., 236, 246, 275, 276,279 Cahen, L.S., 234, 246, 248, 251, 253, 254 Campbell, D.T., 78, 188, 191 Carroll, J.B., 145,147,179,204,217,282, 283, 284, 286, 287, 288, 289 Cartier, F.A., 140, 147 Cattell, R.B., 238, 245, 246, 273, 279 Clearly, T„ 131, 135 Coffman, W.E., 106 Cohen, M. J., 261,270 Coleman, J.S., 49, 72, 75, 76 Corey, J.R., 220, 225 Cox, R.C., 110, 113, 116, 129, 135, 143, 147 Cox, R.J., 194, 202 Crombag, H.F., 264, 266, 270 Cronbach, L.J., 110, 117, 128, 135, 189, 191, 230, 234, 235, 246, 248, 251, 252, 254, 266, 270, 272, 273, 274, 275, 277, 279, 281, 286, 287, 289 Crutchfield, R.S., 221, 225 Dahl, T„ 109, 117, 130, 136 Daniels, L.F., 174, 177 Davis, F.B., 120, 132, 136 DeBord, L.W., 303 Dewey, J., 175, 176; 177 Diaz, A.R., 304 Donlon, T.F., 185, 191 Dowalby, F.J.,251,254 Dunham, J.L., 236, 246, 276, 279 Dymond, R., 12, 27

338

Name index

Ebel, R.L., 110, 117, 132, 136, 137 Edel, A., 176, 177 Egan, A.L., 305 Egan, D., 266 Eisner, E.W., 110, 117 Elstein, A.S.,264, 270 Emrick, J. A., 112, 117 Etzioni, A., 174, 177 Evans, G., 274, 279 Ewell, K.W., 217 Falstrom, P.M., 297 Feather, N.T., 223, 224 Feldt, L.S., 106 Ferguson, R.L., 112, 117 Ferster, C.B., 220, 225 Fhanér, S., 110, 112, 117 Firestone, I. J., 224 Fischer, F.E., 185, 191 Fiske, D.W., 188, 191 Fitts, P.M., 277, 279 Foshay, A.W., 30, 51 Fox, W.L.,273, 280 Frase, L.T., 220, 225 Fricke, R., 186, 191 Gagné, R.M., 181, 183 , 219, 220,225, 263, 270 Geer, B „ 198, 202 Getzels, J., 174, 177 Glaser, R., 75, 102, 107, 109, 112, 117, 139, 147, 178, 179, 191, 203, 217, 245, 246, 273, 277, 279, 281, 287, 289 Glass, D.C., 224 Glass, G.V. ; 285, 289 Gleser, G.C., 117 Gobits, R., 305 Goolsby, T., 306, 317 Gorth, W.P., 112, 117 Gouldner, A.W., 176, 177 Graham, G.T., 110, 116 Gray, W.S., 217 Green, B.A., 220, 225 Green, D . R . , 2 7 2 , 279 Green, C., 298 Greeno, J.G., 266, 268, 270 Gronlund, N.E., 137, 142, 147 de Groot, A.D., 27, 88, 264, 270 de Gruijter, D.N., 261, 270 Guilford, J.P., 235, 246 Gulliksen, H„ 102, 107, 108, 117 Guthrie, J.T., 223, 225

Guttman, L., 110, 117 Hambleton, R . K . , 110, 111, 112,117 Hamilton, N.R.,278, 279 Hansen, D.N., 276, 280 Hardt, R.H., 241, 246 Harris, C., 120, 136 Harrison, S.A., 306 Harvey, O.J.,253, 255 Hastings, J.T., 178, 179, 192, 203 , 217, 224, 257, 270 Hauck, W.E.,220, 225 Herbert, W. A., 220, 222, 224 Heulinger, J., 224 Hieronymus, A.N., 206 Hilgard, E.R.,253, 255 Hilton, T., 131, 135 Hirsch, E „ 128, 136 Hively, W„ 110, 117, 125, 128, 130, 136 Hofstee, W.K.B., 186, 191 Horn, M., 203, 217 Hsu, Tse-Chi, 114, 117 Hughes, E., 198, 202 Hunt, D.E., 241, 246, 253, 254, 255 Huntley, G.M., 307 Husek.T.R., 109, 118,129,132,136,143, 147, 186, 191 Hus6n, T., 27, 34, 51 Inhelder, B., 275 Ivens, S.H., 110, 111, 117 Jackson, D.N., 175, 177 Jackson, R., 110, 111, 117 Jason, H., 270 Jenkins, J. J., 259, 270 Jensen, A.R., 273, 279 Johnston, J.M., 220, 225 Jones, T.C., 174, 176 Kagan, I., 223, 225, 276, 279 Kagan, N., 270 Karlins, M., 253, 255 Katona, G., 263, 270 Keene, R., 309 Keller, C.M., 120, 136 Keller, F.S., 220, 225 Kent, G.W., 221, 225 Kim, H., 203, 217 Kimbrell, G.McA., 220, 225 Kirkland, M.C., 221, 225 Klaus, D.J., 102, 107

Name index Klausmeier, H . J . , 203, 217 Klein, S.P., 106, 119, 130, 132, 136 Knipe, W „ 203, 217 Knutzen, J. J., 220, 225 Koen, B . O . , 220, 225 Kogan, N „ 276, 279 Koran, M . L . , 251, 255, 273, 274, 276, 277, 279 Kosecoff, J . B . , 119, 130, 136 Krahmer, E . F . , 217 Krech, D., 221, 225 Kriewall, T . E . , 110, 112, 117, 128, 136 Langerak, W . F . , 260, 270 Lauwerys, J . A., 148, 162 Lesniak, R . J . , 309 Levin, J . R . , 286, 289 Levin, L „ 113, 114, 117 Lewis, L., 310 Lindquist, E . F . , 209, 217 Linn, R . L . , 131, 136 Livingston, S.A., 106, 110, 111, 117, 143, 147, 186, 191 Livson, N . L . , 221, 225 Lloyd, K . E . , 220, 225 Lockheed Katz, 308 Lokerson, J. E., 311 Lord, F . M . , 108, 118, 160, 162 Loue, W . E . , 2 0 3 , 217 Lougher, N., 312 Loupe, M . J . , 270. 271 Lubin, A., 285, 289 Lundin, S., 136 Lynch, M . D . , 313 Madaus, G . F . , 112, 116, 120, 135, 178, 179, 191, 203, 217, 224, 257, 270 Mager, R . F . , 124, 136, 181, 183 Mandler, G . , 259, 270 Marascuilo, L . A . , 286, 289 Maritain, J., 174 Marks, S., 249, 255 Marton, F., 110, 113, 114, 116, 117, 118 Mathews, W., 314 Maxwell, G . , 136 Mayo, S . T . , 145, 147 McArthur, C „ 194, 202 McDonald, F . J . , 251, 255, 276, 279 McLelland, D „ 253, 255 McMichael, J . S . , 220, 225 Mechanic, D., 198, 202 Meleca, C . B . , 293, 296

339

Melton, A.W., 249, 255, 274, 279 Merrill, M . D . , 205, 217 Messick, S „ 173, 175, 177, 277, 279 Miller, C., 197, 202 Millman, I., 125, 133, 136 Mischel, W „ 277, 279 Mitchell, J. V., 286, 289 Montez, P., 315 Moore, J . W . , 220, 225 Morasky, R . L . , 316 Morris, C . J . , 220, 225 Mulder, J . C . , 316 Mullen, D . J . , 317 Murray, J . E . , 277, 279 Myers, J . A . , 2 9 3 , 296 Myrow, D . L . , 220, 224 Nanda, H „ 110, 117 Niedermeyer, F . C . , 252, 255 Nitko, A . J . , 75, 109, 112, 113, 117, 118, 178, 179, 191, 287, 289 Novick, M . R . , 106, 108, 118 Nunnally, J. C., 188, 191 Oetinger, A., 249, 255 Okada, M., 252, 255 O'Neil, H . F . , 276, 280 Ongiri, D . O . , 318 Osburn, D . G . , 113, 115, 118 Ozenne, D . G . , 109, 113, 118,130, 136 Pace, C . R . , 202 Page, S.H., 117 Parlett, M., 197, 202 Patterson, H . L . , 117 Peaker, C., 77 Pearlstone, Z., 259, 270 Pennypacker, H.S., 220, 225 Philabaum, C . M . , 293 Piaget, J., 275 Piper, R . M . , 271 Popham, W . J . , 106, 109, 113, 114, 115, 118, 124, 129, 130, 132, 136, 143, 147, 186, 191 Popper, K . P . , 264, 271 Postman, L., 259, 271 Quie, A.H., 120, 136 Rabehl, G., 136 Rajaratnam, N., 117 Rasch, G., 79

340

Name index

Ravitch, M., 118 Ray, H.W., 240, 246 Reid, C„ 273, 280 Rentz, R.R.,298, 323 Robeck, M.C., 273, 280, 327 Roderick, M.C., 220, 225 Rogers, C.R., 12, 27 Rosenstiel, A., 319 Roser, N., 320 Rothkopf, E.Z., 220, 225 Roudabush, G.E., 129, 130, 136 Russell, W. A., 259, 270 Ryle, A., 202 Salomon, G., 249, 251, 252, 255, 285, 289 Sarason, S.B., 166, 168, 177 Scanion, D.G., 148, 162 Schott, E., 320 Schroder, H.M., 253, 255 Scotti, R.L., 313 Scriven, M„ 165, 178, 179, 219, 225 Seiber, J.E., 275, 279 Seibert, W., 273, 280 Selz, O., 264, 271 Sension, D., 136 Shavelson, R., 110, 118 Shearer, J.W., 250, 252, 255 Shiffrin, R.M., 258, 262, 270, 271 Shulman, L.S., 186, 191, 250, 255, 269, 270, 271, 289 Sie, M., 321 Sieber-Suppes, J.E., 251, 252, 255 Skager, R„ 121, 127, 128, 134, 136 Skinner, B.F., 173, 174, 177, 220, 225 Smith, R., 322 Snow, R.E., 230, 234, 235, 246,249, 250, 251, 252, 254, 255, 272, 273, 275, 276, 277, 279, 280, 286, 287, 289 Snyder, B„ 198, 202 Spielberger, C.D., 276, 280 Stake, R.E., 175, 177, 202 Stailings, J.A., 242, 244, 247 Stanley, J.C., 106, 285, 289 Starr, R.J., 323

Stoddard, G., 275, 280 Stolte, J. J.B., 217 Sullivan, H.J., 252, 255 Suppes, P., 204, 217 Tallmage, G.K., 250, 252, 255 Taylor, J.E., 273, 280 Thompson, D.J., 323 Thorndike, E.L., 139, 147 Thorndike, R.L., 51, 178, 191 Tiedeman, H.R., 220, 225 Tobias, S„ 220, 225 Tomkins, S.S., 175, 177 Tucker, L., 175, 177 Tulving, E„ 259, 271 van Tuyl van Serooskerken, E.H., 270 Uhl, N.P., 175, 177 Underwood, B.J., 259, 260, 271 Vargas, J.S., 113, 116, 129, 135, 143, 147 van der Ven, A., 324 Vogel, M., 217 Washburne, C.W., 204, 217 Weber, M„ 176 Wedman, J., 113, 115, 116, 118 Welford, A.T., 256, 271 White, W.F., 326 van Wieringen, A.M.J., 183 de Wijkerslooth, J.L., 270 Wilson, H.A., 128, 137 Wilson, J. A.R., 273, 280, 327 Wirters, D.R., 221, 225 Witkin, H.A.,237, 247 Wittrock, M.C., 268, 271 Wolfe, F„ 327 Wood, D.A., 137 Wood, G., 259, 260, 271 Wood, L.E., 217 Yost, M., 109, 118 Zweig, R., 128, 137

Subject index

ability to learn, 235 acceptability of theory, 16, 22, 23 accountability, 321 accreditation model, 66, 67 achievement measures, — multivariate nature of, 276, 277 — general and specific, 277 — design of, 277, 278 achievement motivation, 303 achievement test, 179-182, 188-190 achievement variance, 203-216 adaptive education, 178 advance organizers, 236 age groups, 81 a-error, 112 analysis of variance, 109 ANPA Newspaper Test, 153 anthropology, 319 anxiety, 238, 251, 252 aptitudes, 229, 281-285, 287, 288 aptitude measures — general, 272, 273 — fluid and crystallized, 273 — Level I and Level II, 273 aptitude treatment interaction, 229-246, 248-254, 266-268, 281, 282, 284-286, 288, 289, 297 attitudes, cognitive infrastructure of, 20 Australia, 31, 34, 47 behaviorism, 13, 19 Belgium, 30, 31, 34, 49 Beloe Report, 148, 162 3-error, 112

California Achievement Test, 242, 243 Certificate, — General Certificate of Education (G.C.E.), 148 — Certificate of Secondary Education (C.S.E.), 148 Chile, 34 Classroom Behavior Task, 309, 310 cluster analysis, 182 coefficient alpha, 189, 190 coefficient alpha stratified, 189, 190 cognitive development, 305 cognitive Style, 237, 238 cognitive variables, 274, 275 Coleman Report, 72, 75, 76 comparative evaluation, 10 compulsivity, 239 computer managed instruction (CMI), 293 conceptual complexity, 237, 241, 253 conditioning, 19 content validity, 130 contingent coefficient, 111 contract, educational, 24, 25 correlation, 157 — product-moment, 111 — biserial, 113 — point bisserial, 113, 156 course evaluation, 192-202 course-structure, 199-202 creative aptitude, 313 criterion level, see pass-level criterion performance, 301 criterion-referenced evaluation, 178, 186 criterion referenced tests, 74, 75, 101,

342

Subject index

108-112, 115, 116, 120-123, 139-146, 203 criterion score, 155 critical limit, see pass-level, curriculum change, 167-169 curriculum outcomes, — intended, 166, 167, 169, 170 — unintended, 166, 167, 170 cut-off limit, see pass-level design, interrupted time-series design, 78 regression discontinuity design, 78 diagnostic evaluation, 179, 186 diagnostic testing, 320 discrimination, — index, 112-115 — power, 113 educational change, 165-176, 300 educational improvement, 165-176 educational organizations, 306, 317, 318 educational outcomes, 304 England, 30, 31, 34 Equality of Opportunity Survey, see Coleman Report essay, 323, 324 evaluation, 108, 234, 240-246, 302, 310, 316, 321 — criterion referenced, 178, 186 — diagnostic, 179, 186, 305, 306 — of educational programs, 299, 300, 309 — formative, 165,178-186,188,192,193, 203-216, 222, 269, 298, 311 — inverted diagnostic, 186 — of measurement model, 322 — methodology, 165-176 — norm referenced, 186 — role of ideological perspective, 173-176 — summative, 165, 172, 222 — time perspective, 172-173 evaluative research, 165-176 examinations, 194-196 exercise development, 82, 83 extraversión v. introversion, 239 factor analysis, 182 field independance, 138 Finland, 30, 31, 34 fluid v. crystallized ability, 238, 245 France, 30, 31, 34, 40 free recall, 259 generalizability, 110, 111

Geography, 30 Germany, 30, 31, 34, 47 goal purity, 25 grades, 220, 222 group comparisons, 84-85 group differences, 85-86 Guttman-analysis, 182 homogeneity, 105 Hungary, 34 Illinois Test of Psycholinguistic Abilities, 242, 243 independence v. conformity, 239 India, 34, 40, 41, 51 individual differences, 215, 216, 258, 262, 266-269 individualization, 229, 232, 245 individualized instruction, 248, 292 informal education, 318, 319 information feedback, 326 instruction, — inductive, 252 — deductive, 252 — didactic v. inquiry oriented, 234, 237, 239 — discovery method, 253 — matching environments to, 253 — psychological functions of, 249, 250 — structured v.unstructured,238,239,241 — visual-spatial v. verbal, 236 Instructional Objectives, Exchange, 144, 147 instructional treatment, 230 integrated studies, 150 intelligence, 253 interaction, — disordinal, 284-286 — ordinal, 285, 286, 288 — second-order, 286 internal consistency, 181, 188-190 International Association for the evaluation of Educational Achievement (IEA), 28, 30, 38, 41, 50, 51, 70, 73, 74, 75, 76, 78 International Educational Achievement Project, See IEA Iran, 34, 40, 41,49 Ireland, 34 Israel, 30, 31, 34 Italy, 34 item-analysis, 113, 115

Subject index item construction, 124-129 item discrimination, 143, 144 item sampling, 105 item, sensitivity of, 130 item selection, 124-129 Japan, 31, 34 Keller Method, 220, 222 Kindergarten evaluation of learning potential (KELP), 327 knowledge, acquisition of, 256-262 Kuder-Richardson formula no 20, 189, 190 Language comprehension, 327 learner report, 20-22 learning, 310 — activities, 60-69 — areas, 81, 82 — avoidance, 243 — by discovery, 268 — counselor, 295 — effects, 17 — environment, 290 — hierarchies, 180, 181 — rate, 203-216 — theories of, 234 — time, 203-216 — time needed for, 186 lecture v. small groups, 239 limit, see pass-level Listening Comprehension Test, 160 Literature, 39 management strategy, 294 mastery learning, 125-128, 131-133, 145, 146, 178, 180, 184, 186, 203-216, 219223,269 mathematics, 30, 31, 32 memory, 258 — organization in, 259-262 — retrieval, 259 memory and reasoning abilities, 236 sequence memory ability, auditory and visual, 242-244 models of item responses, 184, 185, 190 moderation of grades, 149 modularized instruction, 293 monitoring test, 152 motivation, 223 multitrait-multimethod, 188

343

multivariate techniques, 182 National Assessment of Educational Progress (NAEP), 28, 74, 81-86 national educational System (NES), 9-27 the Netherlands, 31, 34, 49 New Zealand, 34 non-verbal ability, 30 norm referenced tests, 101,108-110,138146 norms, 158, 159, 299, 317 Objectives, 16, 18, 70-72, 104, 109, 110, 113, 123, 124, 180-183, 189, 256, 263 — attitudinal, 20 — development of, 82, 124 — non-cognitive, 18, 23 — not-easily measurable, 18, 21, 26 — sufficiency of, 16 — types of, 18 operationalism, 13, 104 pass-level, 105, 111, 112, 114, 141, 142 pass-limit, see pass-level pass-mark, see pass-level performance contracting, 240 personality variables, 275, 276 personalized system of instruction (PS1), 220 personalized testing, 323 (^-coefficient, 114 phonics v. whole word in reading instruction, 237, 242 Piaget-tasks, 298 Poland, 30, 34 policy decision-making, 172-176 policy goals, 16, 26 post-test, 109, 113, 115 pre-test, 109, 113, 115 problem-solving, 263, 268, 269 — style, 269 proctors, 220, 221 programmed instruction, 235, 240, 245, 262, 269 programming systems, 316 process analysis, 274, 275 p-value, 185 Rasch model, 298, 299 rate variance, 203-216 rational reconstruction, 264, 266 reading, 312

344

Subject index

reading comprehension, 30, 39, 46 regression analysis, 42-46, 75-76 reliability, 110, 111, 143, 181, 185, 186, 188, 189 — r perbis, 185 retentivity, 24 rule learning, 268 Rumania, 34 sampling, 32, 39-41, 72, 73, 83, 84 scalogram analysis, 110, 111 Schematische Anticipation, 264 Science, 30, 39 scores, interpretation of, 132 Scotland, 30, 31, 34 selective education, 180, 186, 187 sensitivity index, 113 serial learning, 259 sex bias, 308 simulation, 309, 310, 311 skill, acquisition of, 256, 257, 263-269 skill hierarchies, 307 standard error of measurement, 110, student — as adolescent, 194 — as consumer, 193, 200 — expectations, 197 — identity, 198 — projects, 201 supplantation, 252 Sweden, 30, 31, 34, 40, 47, 48 Switzerland, 30 systems approach, 13

target population, 32, 35, 47 teaching, improvement of, 185, 190 test bias, 131 test directions, 320 testing, bicultural-bilingual children, 315 — continuous, 306 — game, 291 — organizations, 87-97 — reports, 314 Test on the Understanding of Science and Scientific Principles (TOUS), 38 tests, achievement tests, 179-182,188-191 — flexilevel, 160 — instructional dependency of, 127 — time-limit, 324 test theory, classical, 180, 186, 187 test question banks, 293 Thailand, 34 treshold-value, see pass-level trait-treatment interaction, 284 transfer, 263, 266 Upward Bound program, 241 USA, 30, 31, 3 4 , 4 0 , 4 7 validity, 109, 110, 143 — construct, 234 — internal, 249 variability of scores, 142, 243 variance, 285 — accounted for, 252 Yugoslavia, 30