The Ethics of Language Assessment: A Special Double Issue of Language Assessment Quarterly 0805895256, 9780805895254

First Published in 2004. Routledge is an imprint of Taylor & Francis, an informa company.

139 75

English Pages [115] Year 2004

Table of contents :
Cover
Copyright
Contents
EDITORIAL
ARTICLES
Introduction: Language Testing and the Golden Rule
Thinking About a Professional Ethics
Stakeholder Involvement in Language Assessment: Does it Improve Ethicality?
A Code of Practice and Quality Management System for International Language Examinations
The Role of a Language Testing Code Ethics in the Establishment of a Code of Practice
Using the Modern Language Aptitude Test to Identify a Foreign Language Learning Disability: Is it Ethical?
Ethical Considerations in the Assessment of the Language and Content Knowledge of U.S. School-Age English Learners
Filmic Portrayals of Cheating or Fraud in Examinations and Competitions

Recommend Papers

Language Classroom Assessment 9781942223900, 9781942223146

The day-to-day assessment of student learning is unquestionably one of the most demanding, complex, and important tasks

146 105 1MB Read more

The Assessment of Emergent Bilinguals: Supporting English Language Learners 9781783097272

This textbook is a thorough introduction to the assessment of English Language Learners or Emergent Bilinguals in K-12 s

135 97 11MB Read more

Second Language Pronunciation Assessment: Interdisciplinary Perspectives 9781783096855

This book is open access under a CC BY licence. It spans the areas of assessment, SLA and pronunciation and examines top

142 41 4MB Read more

Language Assessment and Programme Evaluation 9781474470353

GBS_insertPreviewButtonPopup('ISBN:9780748615629); This book explores key areas of modern society in which langua

122 92 19MB Read more

Assessment for Language Teaching 9781108934091, 9781009468152, 9781108928779

98 69 2MB Read more

Handbook of Second Language Assessment 9781614513827, 9781614515425, 9781614516248

Second language assessment is ubiquitous. It has found its way from education into questions about access to professions

168 84 2MB Read more

Handbook of Second Language Assessment 9781614513827, 9781614515425, 9781614516248

Second language assessment is ubiquitous. It has found its way from education into questions about access to professions

190 76 7MB Read more

Nanotechnology: Assessment and Perspectives (Ethics of Science and Technology Assessment) [1 ed.] 354032819X, 9783540328193

Since nanotechnology is considered a key for the 21st century, its promises have been assessed by various scientific com

389 41 6MB Read more

Fundamental Considerations in Technology Mediated Language Assessment 9781032273648, 9781032273655, 9781003292395

Fundamental Considerations in Technology Mediated Language Assessment aims to address issues such as how the forced inte

192 37 6MB Read more

Measures of Language Proficiency in Censuses and Surveys: A Comparative Analysis and Assessment 9783319729411, 9783319729404, 3319729411

This book offers a systematic analysis of a wide range of questions used in censuses, national surveys and international

124 34 1MB Read more

The Ethics of Language Assessment: A Special Double Issue of Language Assessment Quarterly
0805895256, 9780805895254

Author / Uploaded
Alan Davies

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

LANGUAGE ASSESSMENT QUARTERLY An International Journal Editor Antony John Kunnan, California State University, Los Angeles

Associate Editors Fred Davidson, University o f Illinois, Urbana-Champaign Nick Saville, University o f Cambridge Carolyn Turner, McGill University

Editorial Advisory Board J. Charles Alderson, Lancaster University Desmond Allison, National University o f Singapore Lyle Bachman, University o f California, Los Angeles Micheline Chalhoub-Deville, University o f Iowa Liying Cheng, Queen’s University, Kingston Christine Coombe, Higher Colleges o f Technology, Dubai Alister Cumming, University o f Toronto Alan Davies, University o f Edinburgh Ahmed Dewidar, The American University in Cairo Hossein Farhady, Iran University o f Science and Technology Janna Fox, Carleton University, Ottawa Sally Jacoby, University o f New Hampshire Dorry Kenyon, Center fo r Applied Linguistics, Washington, D.C. Anne Lazaraton, University o f Minnesota Judith Liskin-Gasparro, University o f Iowa Brian Lynch, Portland State University Constant Leung, K ing’s College, London University Tom Lumley, University o f Melbourne

Reynaldo Macias, University o f California, Los Angeles Rama Mathew, Delhi University Penny McKay, Queensland University o f Technology, Brisbane Tim McNamara, University o f Melbourne Mats Oscarson, Goteborg University James Purpura, Teachers College, Columbia University David Qian, Hong Kong Polytechnic University Miyuki Sasaki, Nagoya Gakuin University Mary Schedl, Educational Testing Service, Princeton Elana Shohamy, Tel-Aviv University Mary Spaan, University o f Michigan, Ann Arbor Bernard Spolsky. Bar-Ilan University Lynda Taylor, University o f Cambridge Jacob Tharu, Central Institute o f English & Foreign Languages, Hyderabad Piet van Avermaet, Katholieke Universiteit, Leuven Yoshinori Watanabe, Akita University Cyril Weir, University o f Surrey, Roehampton

Production Editor: Cindy Capitani, Lawrence Erlbaum Associates, Inc. First published 2004 by Lawrence Erlbaum Associates, Inc. Published 2017 by Routledge 2 Park Square, Milton Park, Abingdon, Oxon 0X 14 4RN 711 Third Avenue, New York, NY 10017, USA Routledge is an imprint o f the Taylor & Francis Group, an informa business Copyright © 2004, Lawrence Erlbaum Associates, Inc. Special requests for permission should be addressed to the Permissions Department, Lawrence Erlbaum Associates, Inc., 10 Industrial Avenue, Mahwah, NJ 07430—2262. All rights reserved. No part o f this book may be reprinted or reproduced or utilised in any form or by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying and recording, or in any information storage or retrieval system, without permission in writing from the publishers. Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. ISSN 1050-8406 ISBN 13: 978-0-8058-9525-4 (pbk)

LANGUAGE ASSESSMENT QUARTERLY An International Journal

Volume 1, Numbers 2&3, 2004 SPECIAL ISSUE: The Ethics of Language Assessment Alan Davies, Guest Editor EDITORIAL.............................................................................................................95 Carolyn E. Turner

ARTICLES Introduction: Language Testing and the Golden R u le .........................................97 Alan Davies Thinking About a Professional E th ics.................................................................. 109 Sharon Bishop Stakeholder Involvement in Language Assessment: Does it Improve Ethicality?................................................................................... 123 Rama Mathew A Code of Practice and Quality Management System for International Language Exam inations.................................................................. 137 Piet Van Avermaet, Henk Kuijper and Nick Saville The Role of a Language Testing Code Ethics in the Establishment of a Code of P rac tic e.............................................................................................. 151 Randy Thrasher Using the Modem Language Aptitude Test to Identify a Foreign Language Learning Disability: Is it Ethical?...........................................................................161 Daniel J. Reed and Charles W. Stansfield Ethical Considerations in the Assessment of the Language and Content Knowledge of U.S. School-Age English Learners...............................................177 Alison L. Bailey and Frances A. Butler

Filmic Portrayals of Cheating or Fraud in Examinations and Competitions.............................................................. Victoria Byczkiewicz

LANGUAGE ASSESSMENT QUARTERLY, /(2&3), 95 Copyright © 2004, Lawrence Erlbaum Associates, Inc.

INTRODUCING OUR GUEST EDITOR

Ethics has taken on an increasingly transparent role in the social sciences. With it have come challenges to professional organizations as they seek to discuss, engage in, and finally come to consensus concerning guidelines for their respective fields. This special issue is devoted to ethics in the field of language assessment. We are privileged to have Alan Davies as the guest editor. Alan has dedicated a good part of his career to helping us identify critical ethical issues and to initiating forums where these issues could be discussed and debated. One example is the symposium on the ethics of language testing that he organized at the 1996 meeting of the Association Internationale de Linguistique Appliquee in Finland. Soon thereafter, Alan served as guest editor of a special issue of Language Testing in 1997, which portrayed articles based on that symposium. In addition, Alan was a principal figure in writing the Code of Ethics of the International Language Testing Association. A final example is that he was the keynote speaker at the Language Assessment Ethics Conference held in Pasadena, California, in May 2002, on which this special issue is based. On behalf of the LAQ editors and all of those reading this issue, I would like to thank Alan Davies for sharing his insight and serving as the guest editor of this issue. His ongoing concern is a reminder to our profession of the importance of identifying common ground concerning ethics in language assessment. In the following “Introduction: Language Testing and the Golden Rule,” Alan introduces his perspective on ethics in language assessment and previews the articles in this special issue. Please note that due to the length of this special issue, we have combined issues 2 and 3. Our next issue will be number 4. Carolyn E. Turner Associate Editor

LANGUAGE ASSESSMENT QUARTERLY, /(2&3), 97-107 Copyright © 2004, Lawrence Erlbaum Associates, Inc.

Introduction: Language Testing and the Golden Rule Alan Davies

University o f Edinburgh

Institutional concern for the ethics of language assessment is evident in the attention that the International Language Testing Association (ILTA) has given to its own ethical foundation and to how it presents that foundation to its stakeholders. ILTA has published the ILTA Code of Ethics (International Language Testing Association [ILTA], 2000) and is currently working on a code of practice. In this introduction, consideration is given to the need for ethics to be looked at in terms of the establishment of ILTA and the wider professionalism of the activity. Examples of the kinds of dilemmas that a code of ethics can help clarify are given and an introduction provided for each of the articles in this special issue.

This special issue of Language Assessment Quarterly is dedicated to the ethics of language assessment. Ethics has, of course, always exercised both practice and principles in assessment generally, given the role of assessment in selection and in determining high-stakes choices in people’s lives. And because language is both knowledge and skill, language assessment is particularly vulnerable to ethical challenges. Recent engagement with the ethics of language assessment is evident in the attention that the International Language Testing Association (ILTA) has given to its own ethical foundation and to how it presents that foundation to its stakeholders. That engagement has led to the publication of the ILTA Code of Ethics (International Language Testing Association [ILTA], 2000) and to its current attempts to agree on a Code of Practice. No doubt ILTA, with this upsurge of interest in making its ethics explicit, has been influenced, as have other professional organizations, by the so-called ethical turn in postmodernism and in critical ap-

Requests for reprints should be sent to Alan Davies, University of Edinburgh, Department of Theoretical and Applied Linguistics, Adam Ferguson Building, 40 George Square, Edinburgh EH8 9LL, Scotland, UK. E-mail: [email protected]

98

DAVIES

proaches to academic studies; in language assessment this has manifested itself as critical language testing (Shohamy, 2001). In what follows in this introduction, I consider why ethics needs to be looked at in social terms, which explains why ethics and professionalism are so closely connected. I then give examples of the kinds of dilemmas that a code of ethics can help to clarify. Finally, I introduce each of the articles in this special issue.

NEED FOR A SOCIAL ETHICS “What is it?” asked Peter Singer (2002, p. 13) to make moral judgements or to argue about an ethical issue or to live according to ethical standards? ... Why do we regard a woman’s decision to have an abortion as raising an ethical issue but not her decision to change her job?

The answer, Singer suggested, is indeed the golden rule; an ethical belief or action or decision is one that is based on a belief that it is right to do what is being done. The justification must be of a certain kind; self-interest alone will not do (e.g., I am writing a language test for migrant screening that I do not approve of, but I need the money). “Self-interested acts must be shown to be compatible with more broadly based ethical principles if they are to be ethically defensible” (2002, p. 14). Incidentally, Singer dismissed the relativist argument on the grounds that although some societies or cultures may differ in their beliefs and practices, either these differences are trivial (e.g., a body ornament) or they are incommensurable. If Society A has slavery and Society B does not, on a relativist analysis there is nothing to say except that they are different. Ethics, in other words, must be universal: What is in dispute therefore is what counts as being ethical. What counts is what underlies social practices, the social content of ethics. Ethics is not just a matter of individuals ventilating their feelings or prescribing what they would like to see happen. It is rather a set of social practices that has a purpose, namely, the promotion of the common welfare. Moral reasoning, therefore, is simply a matter of trying to find out what is best for everyone, achieving the good of everyone alike—the golden mean. Does it help or harm language testers to wear saucepans on their heads? If not (and if it has no effect either way on their professional work as language testers) then it has nothing to do with morality. The argument for the social role of morality has been enriched by the increasing professionalizing of work domains in the late 20th century. Everyone wants, it seems, to be a professional. And one of the public demonstrations of a work domain as a profession is that it has an ethic that sets out (often in a published code) what its social practices are. To an extent this professionalizing of everything and its concomitant explosion of ethical advertising through codes explains the advent

LANGUAGE TESTING AND THE GOLDEN RULE

99

and expansion of the applied ethics movement. Ethics was in demand everywhere, it seemed, in business, on the environment, on biomedical issues. A major landmark in the development of a social ethics was the publication of John Rawls’s (1972) A Theory o f Justice. Rawls provided a variant on the familiar idea of the social contract. He argued on the basis of an “idealized original position” that (a) everyone should have the most extensive liberty compatible with a similar liberty for others, and (b) social or economic inequalities should not be permitted unless they work to everyone’s advantage and are attached to positions open to everyone. From the point of view of justice, social institutions are acceptable only if they satisfy these principles. What Rawls (and more recently David Gauthier’s “contractarianism” does is to finesse the traditional question about the objectivity of ethics (Gauthier, 1986). Morality is a rational enterprise. Certain goods, it is true, cannot be obtained without social cooperation, and therefore rational self-interested people will be motivated to obtain these goods. This cooperation will involve accepting rules that constrain behavior. If this is what moral rules are like, then it is easy to explain their rationality and objectivity without resorting to any strange or mystifying conception of objective rules.

CODES OF ETHICS AND OF PRACTICE ETHICAL DILEMMAS The ILTA Code of Ethics consists of a preamble and nine fundamental principles, which generally clarify through a series of annotations the nature of the principles; they prescribe what ILTA members ought to do or not do, or more generally how they ought to comport themselves or what they, or the profession, ought to aspire to; and they identify the difficulties and exceptions inherent in the application of the principles. (ILTA, 2000)

The annotations further elaborate the code sanctions, making clear that failure to uphold the code may have serious penalties, such as withdrawal of ILTA membership on the advice of the ILTA Ethics Committee. And yet many feel that a code of practice is needed in addition to the Code of Ethics. Why is this so? Although the principles in the ILTA Code of Ethics are not concerned with detail, they are supported by the explanatory annotations. Nevertheless, the requirements of local practice and the demands of particular cultures lie behind the search for a code (or codes) of practice. How much a code of practice will help is unclear. Let us take some hypothetical examples from language testing that might face ILTA members. Principle 3 of the Code of Ethics states: “Language testers shall adhere to all relevant ethical principles embodied in national and international

100

DAVIES

guidelines when undertaking any trial, experiment, treatment or other research activity” (ILTA, 2000). The 11th annotation reads: “Publication of research reports shall be truthful and accurate” (ILTA, 2000). It is not difficult to think of circumstances in which researchers might be required by their government funding bodies to be economical with the truth—for example, when it is government policy to encourage a marginalized group (such as aboriginal children) to believe they are achieving when they are not by withholding poor test results. If a code of ethics does not provide guidance on how to act in such a situation, to what extent could there be common agreement about a code of practice? Principle 4 of the Code of Ethics reads: “Language testers shall not allow the misuse of their professional knowledge and skills, in so far as they are able” (ILTA, 2000.). The second annotation reads: “Non-conformity with a society’s prevailing moral, religious etc values or status as an unwelcome migrant, shall not be the determining factor in assessing language ability” (ILTA, 2000). This appears straightforward enough. And yet, colleagues who live and work in totalitarian regimes may find their professional activity so constrained that they are compelled to exercise their knowledge and skill in the pursuit of what seems to them immoral ends—for example, excluding political refugees on the basis of their inadequate language proficiency. (The long experience in Australia of the use of the infamous Dictation Test [Davies, 1997a], which was used precisely to exclude undesirable aliens, reminds us how easy it is for professionals to acquiesce in dubious ethical practices.) What is at issue here, we should note, is not so much a cross-cultural dispute, because the member at risk who is under compulsion does not disagree with the ILTA code but is being compelled locally to disagree. What is more difficult to resolve is when there is a genuine cross-cultural issue, that is, when there is principled disagreement between the ILTA Ethics Committee and the local member who may, in this instance, maintain that to use his or her knowledge and skills to develop a language test that excludes refugees is not inappropriate, not unethical. I want now to take two real-life examples of what we may regard as ethical dilemmas in language testing (Davies, 1997a). They concern the use of a language test in immigration procedures. My first example is the International English Language Testing System (General Training) [IELTS] that is currently used as part of the immigration procedures in Australia and New Zealand. Now it is plausible that some language testers, perhaps some who have been involved in IELTS developments would not approve of the use of an English language requirement for immigration vetting and selection. Here we can appeal to the profession. We can point out that the profession in general would probably not disapprove of such a test use. The operation has legitimacy: IELTS is an admired instrument, and the activity is conducted seriously and professionally, it might be argued. Individual language testers who object to such immigration uses are free, on the grounds of conscience, not to take part. What matters, I suggest, is that there is a social decision to be

LANGUAGE TESTING AND THE GOLDEN RULE

101

made: Individuals do, of course, have the right to influence colleagues and change their views, but what is ethical is what the profession decides, and is prepared to present a reasoned case for. My second example is also about Australian immigration, the notorious Dictation Test. When the States of Australia (New South Wales, Victoria and so on) federated as the Commonwealth of Australia in 1901, one of the first acts passed by the new federal parliament, in pursuance of the “White Australian” policy, was the Immigration Restriction Act (1901). Under the terms of this act, immigrants to Australia were required to pass a Dictation Test. Masquerading as an examination in literacy, the aim of this test was in fact explicit racial exclusion. Applied at the discretion of federal immigration officers, it originally required the correct transcription of 50 words in any European language, later extended to “any prescribed” language. Within the period 1902 to 1946 rigorous application of this test ensured only around 125,000 members of the so-called “alien” races were admitted to Australia. (York, 1993, p. 6)

Archived documents of the Commonwealth Migration Department make it clear that the sole purpose of the Dictation Test was exclusion: Sections 3 (l)(a) and 5 of the Immigration Act 1901-1949 authorise the administration of the Dictation Test. Section 3(1 )(a) is designed for use on arrival to prevent the entry of persons who are not specifically restricted under the remaining sub-sections of Section 3 and whom it is desired to exclude. (Australian Archives, 1956)

That there was no irony in the interpretation of the intention of the act: “to prevent the entry of persons ... whom it is desired to exclude” is made clear by the later Clause 6 of the instructions: Before the test is applied, endeavours should be made to ascertain in what language the person concerned is literate and a passage of not less than 50 words in a European language with which he is not [italics added] conversant should be selected. (Australian Archives, 1956)

This is reminiscent of those fairy tales in which the prince is given an impossible task to perform, but the difference of course is that in fairy tales, the prince always succeeds. This was no fairy tale: The “person concerned” always failed. Even if by some fluke he or she was conversant with the language of the Dictation Test administered, or if there was an appeal on some technicality (such as lack of fluency of the dictation officer) “the Department is quite at liberty to give another test and eventually secure conviction” (Australian Archives, 1956). At the heart of the regulations was a profound cynicism (shared, it appears, by the

102

DAVIES

language consultant, the “Professor of Languages at the National University”). For legal reasons, dictation tests using English were for a period not allowed, on the grounds that there was some doubt as to whether English should be regarded as a “European language.” However, as Minute 4 of the Commonwealth Immigration and Multicultural Affairs Committee (CIMAC) meeting of July 26-27, 1956 states It would be much more convenient and less productive of litigation, if English could be used with certainty. The test has very rarely to be used for the deportation of English speaking people, so that the use of English would effectively deal with the great majority of cases. (Australian Archives, 1956)

The Scotsman newspaper for 21 April 1999 carried a report in Gaelic (in its weekly Gaelic column) on the Dictation Test. An English insertion reads thus In 1934 the Australian authorities tried to use their Immigration Act which stated that visitors could be given a Dictation Test in any European language, to keep Czech communist Egon Kisch out of the country. Kisch was tested in Scottish Gaelic. He duly failed, but to the chagrin of Australia’s Gaelic speakers, his appeal was upheld on the grounds that Gaelic was not a language, (p. 21)

The professor of languages at the National University was the consultant who set (or approved) the texts for dictation. What seems very clear is that he must have known that this test, in its various versions, was intended to exclude. It was not a proficiency test; it was a means of exclusion, hypocritically masquerading as a language test. Was the professor of languages behaving ethically? Surely not. If ILTA had existed and he had been a member, then he would have been a candidate himself for exclusion. The two cases are superficially alike, but they are in fact quite different. The reason has already been suggested. The use of IELTS is acceptable because it is professionally developed and administered. The use of the Dictation Test is not acceptable because it was a trick, and the professor of languages at the National University was himself giving face validity to shonky behavior that cannot be condoned. The question I suppose hangs in the air—did he know? Surely in his case, as a professional linguist, if he did not know, he should have. The ILTA Code of Ethics Principles 1 and 4 clearly apply: 1. Language testers shall have respect for the humanity and dignity of each of their test-takers. They shall provide them with the best possible professional consideration and shall respect all persons’ needs, values and cultures in the provision of their language testing service. 4. Language testers shall not allow the misuse of their professional knowledge or skills, in so far as they are able. (ILTA, 2000)

LANGUAGE TESTING AND THE GOLDEN RULE

103

And Principles 8 and 9 are also somewhat relevant: 8. Language testers shall be mindful of their obligations to the society within which they work, while recognising that those obligations may on occasion conflict with their responsibilities to their test takers and to other stakeholders (and) 9. Language testers shall regularly consider the potential effects, both short and long-term on all stakeholders of their projects, reserving the right to withhold their professional services on the grounds of conscience. (ILTA, 2000) It is precisely to encourage ethicality among language testers that ILTA developed its code of ethics. No doubt it is because codes do not provide clear-cut guidance in such matters that leads Beyerstein to comment: “Many people become cynical about Codes of Ethics, because they suspect that such codes are of no help in resolving moral dilemmas. ... Codes of ethics exist primarily to make professionals look moral” (1993, p. 417). It may be that our concern with codes represents a kind of false consciousness of professional activity, a kind of flag of convenience to justify the public and professional claims of an activity in relation to society at large. Even so, a code of ethics can provide guidance to individual professionals, guidance to ethics committees, announcing to the public the profession’s intentions and informing other professionals as to what they may expect. Perhaps the basic problem with drafting and reaching agreement on codes is that by their very nature they subtly move what is intended as an ethical framework to a legalistic one. But the activity that lacks Banks’s three criteria (single form of activity and one basic qualification, one type of work, strong organization) has no mechanism for reaching a legalistic accord, especially when the attempt is made to impose a code of practice internationally (Banks, 1995). It seems clear that no single code of practice can ever lay down rules to accommodate all possible circumstances, all contingencies, in all situations worldwide. Decisions as to what to do in particular circumstances have to be made by the individual professional in the light of local conditions. In other words what a universal code of practice does is exactly what a code of ethics does, it offers general advice to the profession. The next stage therefore would seem to be to develop a variety of codes of practice. Although the code of ethics provides for universality, the codes of practice offer cross-cultural interpretations. This accords with the argument of Bauman (1993) that whereas modernism provided a space for the ethics of distance, what postmodernism offers the opportunity for is to develop the morality of proximity. Thus for ILTA the postmodern turn moves us on from the modernist code of ethics to the postmodern codes of practice. But we need to proceed with caution.

104

DAVIES

Reporting on work toward the formulation of a code of practice for Japan, Ross indicates some of the difficulties of a relativist approach: An embrace of postmodernism as the philosophical basis of Japanese language testing may well bring with it a rationale for no modernization of language testing practice in Japan. Modernisation could simply be seen as yet another instance of Western cultural hegemony. Herein lies a great irony. ... If postmodernism justifies cultural relativism, then it rationalizes the pre-modem practices so commonly found in Japan. Calls for greater ethical use of tests and less compliance with the interests of testing as “big business’’ reflecting the Western view that the individual is the potential victim of institutionalized power of international testing agencies (Lemann, 1999), seem not to be so relevant to the Japanese ethos. (Ross, 2001, p. 7)

In other words, the relativist (and postmodern) legitimizing of local testing practice, which we are advocating as in principle the ethical way forward, may be in conflict with the equally local desire to modernize testing procedures. What the profession must do is to make a clear separation between what is of ethical importance and what is culturally distinct. Discussing the cultural differences argument, Rachels (1998) offered the following premise and conclusion: (1) Different cultures have different moral codes, and (2) therefore there is no objective “truth” in morality. Right and wrong are only matters of opinion, and opinions vary from culture to culture. Rachels is dismissive “the argument is not sound. The premise concerns what people believe ... the conclusion ... concerns what really is the case” (p. 551). A profession becomes strong and ethical by being professional (Davies, 1997b). What a code of ethics does is to remind us of what we already know, that we are a serious organization, committed to a social purpose and to working with colleagues to determine how best to achieve that purpose. We do well to spell out in the code of ethics what this means, but there is something to be said for the conclusion that Alderson, Clapman, and Wall (1995) came to, that being ethical in language testing could be guaranteed by the traditional precepts of reliability and validity.

THE SPECIAL ISSUE The articles that follow in this special issue include six revised versions of articles given at the Language Assessment Ethics Conference held in Pasadena, California, in May 2002,1 plus one written for this volume by Victoria Byczkiewicz.* 'The conference was organized by Antony Kunnan based on funding from a TOEFL grant he received for the purpose in 2001.

LANGUAGE TESTING AND THE GOLDEN RULE

105

Sharon Bishop (“Thinking About a Professional Ethics”), herself a moral philosopher, considers “the way ethical and professional ends can come together” (this issue, p. 110). And she points out that “Principles ... are not enforceable as they stand, not because they are merely aspirations, but because they are too general and unspecified” (this issue, p. 117). This is exactly the dilemma facing ILTA in its attempt to specify just how to operationalize a code of ethics through one or more codes of practice. Rama Mathew (“Stakeholder Involvement in Language Assessment: Does it Improve Ethicality?”) examines the issue of language testing ethicality in terms of its “implementation in policies and activities in daily life” through her attempts to involve stakeholders in language assessment. She describes two projects in which she has been involved, the first involving the Central Board of Secondary Education, a national India-wide examination authority, the second the National English Language Testing Service test (also intended to be India-wide). In the course of these projects she became aware of the difficulty of determining precisely who the stakeholders were and of the need to examine “the role of stakeholders vis-a-vis their relationship with one another and ... [to prioritize] their concerns on the basis of a relative weight assigned to them.” Involving stakeholders seems such a forward-looking idea, but Mathew reminds us just how difficult it is: “There are,” she remarks, “unresolved issues.” Piet Van Avermaet, Henk Kuijper, and Nick Saville (“A Code of Practice and Quality Management System for International Language Examinations”), presenting as members of the Association of Language Testers of Europe (ALTE), examine in some detail the crucial role of one type of stakeholder, those involved in quality management of language assessment. They explain how ALTE has helped ensure this through agreement on “Principles of Good Practice for ALTE Examinations” (2001), which is based on the mechanism of the system itself and on “awareness raising” (Van Avermaet, Kuijper, & Saville, this issue, p. 144). To an extent this reminds us of the concept of “virtue ethics” (Black, 1999). Randy Thrasher (“The Role of a Language Testing Code of Ethics in the Establishment of a Code of Practice”) describes his experience of operationalizing the ILTA Code of Ethics in the context of a specific culture, the attempt, in other words, to develop a particular code of practice that has relevance within its own society but not necessarily externally. Thrasher’s purpose is to develop a code of practice that both fits the needs of the testing world in Japan and has some hope of being accepted not only by the members of the language testing association but by the test writing and test using community on Japan. Thrasher wryly remarks that there is no “simple relation between a code of ethics and a code of practice” (this issue, p. 153). For ILTA, the efforts of the Japan Language Testing Association are of great interest because they could show the way, first, for the development of in-country codes of practice and, second, to find just how possible it would be to instantiate a universal code of practice.

106

DAVIES

Dan Reed and Charles Stansfield (“Using the Modem Language Aptitude Test to Identify a Foreign Language Learning Disability: Is it Ethical?”) take up the dilemma of “unintended consequences.” Writing as test publishers, they consider the issue of ethicality of the unintended use of a language test, in their case, the Modem Language Aptitude Test (MLAT) used to assess foreign language learning disability. Pointing out that “there is a specific cognitive basis to the language aptitude construct,” they argue that “a measure of language aptitude is crucial in the diagnosis of a foreign language learning disability” and that therefore extending the use of the MLAT beyond what its author, J. B. Carroll, intended is legitimate and ethical. This is an interesting and plausible argument and avoids the “torturer’s dilemma” whereby, say, a proficiency test is used for shibboleth-like purposes. Alison Bailey and Frances Butler (“Ethical Considerations in the Assessment of the Language and Content Knowledge of U.S. School-Age English Learners”) also take up the issue of testers’ responsibility for consequences, arguing that because “the constructs being measured in the case of EL [English language] students are ... likely different from the constmcts being measured for English proficient students ... [it is] therefore invalid to conclude that their performance on these assessments is in any way meaningful” (this issue, p. 180-181) And they say that because testers must indeed be aware of such test misuse and misinterpretation, they can be accused of a breach of ethicality. Once again we are brought up against the code of ethics-code(s) of practice lack of fit. As Bishop comments, “Principles ... are too general; they lack the specificity necessary to act on them”; what Bailey and Bishop appeal for is a focus on the specific needs of EL students. Considering three filmic portrayals of ethnic dilemmas in education, Victoria Byczkiewicz (“Filmic Portrayals of Cheating or Fraud in Examinations and Competitions”) contends that differences between the affluent and the economically disadvantaged are incommensurable and that it may therefore be “reasonable to suggest that cheating on a high-stakes examination in America comes about as a result of a fiercely competitive culture in which opportunities are unevenly distributed.” Again, there is a plea for specificity, an extreme plea because it proposes that cultural differences even within one society are so great that they cannot be reconciled. Byczkiewicz does not leave it there, however, concluding that there are, nevertheless, “fundamental values that we dearly uphold.” Arriving at and agreeing on those fundamental values is central to the professional path that ILTA has set out on in its work on codes of ethics and of practice and in its efforts to determine just how its code of ethics can be matched by a code (or codes) of practice that will provide the professional guidance that ILTA members desire. We turn now to the articles on the ethics of language assessment, versions of which were presented at the Pasadena conference in May 2002 .

LANGUAGE TESTING AND THE GOLDEN RULE

107

REFERENCES Alderson, J. C., Clapman, C., Wall, D. (1995). Language Test Construction and Evaluation. Cambridge: Cambridge University Press. Australian Archives. (1956). A446/85 Box: ACT1 52 A 2483 1-2483 6 66/45500-66/52200. Banks, S. (1995). Ethics and Values in Social Work. London: Macmillan. Bauman, Z. (1993). Postmodern ethics. Oxford, England: Blackwell. Beyerstein, D. (1993). The functions and limitations of codes of ethics. In E. R.Winkler & J. R. Coombes (Eds.), Applied ethics: a reader (pp. 416-425). Cambridge, MA: Blackwell. Black, C. (1999). In James Rachels (ed.), The right thing to do: Basic readings in moral philosophy. Boston: McGraw Hill. Davies, A. (1997a). Australian immigrant gatekeeping through English language tests: How important is proficiency? In A. Huhta, V. Kohonen, L. Kurki-Suonio, & S. Luoma (Eds.), Current developments and alternatives in language assessment (Proceedings o f LTRC 6) (pp. 71-84). Jyvaskyla, Finland: University of Jyvaskyla Press. Davies, A. (1997b). Demands of being professional in language testing. Language Testing, 14, 328-329. Davies, A. (1997c). Introduction: The limits of ethics in language testing. Language Testing, 14, 235-241. International Language Testing Association. (2000). Code of Ethics for ILTA, Author. (Reprinted in K. Boyd & A. Davies. (2001) Doctors/E orders for language testers: the origin and purpose of ethical codes. Language Testing 19, 296-322) Lemann, N. (1999). The big test: The secret history o f the American meritocracy. New York: Farrar, Straus & Giroux. Rachels, J. (1998). The challenge of cultural relativism. In S. M. Cahn & P. Markie (Eds.), Ethics: History, theory and contemporary issues (pp. 548-557). Oxford, England: Oxford University Press. Rawls, J. (1972). A theory o f justice. Cambridge, MA: Harvard University Press. Ross, S. (2001). Prolegomenon to a code o f practice fo r language testing in Japan. Unpublished manuscript. Shohamy, E. (2001). The power of tests. Essex, England: Longman Pearson. Singer, P. (2002). Writings on an ethical life. New York: HarperCollins. York, B. (1993). Admitted 1901-1946: Immigrants and others allowed into Australia between 1901 and 1946. Studies in Australian Ethnic History, No. 2. Canberra, Australia: Australian National University, Centre for Immigration and Multicultural Studies.

LANGUAGE ASSESSMENT QUARTERLY, /(2&3), 109-122 Copyright © 2004, Lawrence Erlbaum Associates, Inc.

Thinking About a Professional Ethics Sharon Bishop

California State University, Los Angeles

The article begins by considering some of the factors that make thinking about ethical issues difficult especially as they come up in language assessment. It discusses the idea of a professional ethics as a guide for ethical decision-making and rejects virtue and utilitarian accounts for a Kantian perspective. A Kantian perspective is favored in part because it encourages us to think about constructing a professional ethic as an object that helps further complex ethical ends. The ethical code of the American Psychological Association is discussed critically as a means of seeing how to understand differences between professional aspirations and principles that set requirements. I conclude with a brief discussion of 2 issues of concern for language assessment, educational accountability and immigration policy.

I would like to begin my article by explaining its rather general title: “Thinking About a Professional Ethics.” In a sense the title, and the subject it encloses, were forced on me by my own limitations. I am by trade a moral philosopher; moreover, I am a moral philosopher who knows virtually nothing about the profession of language testing or about the consequences and side effects of testing. So, you might ask, what if anything have I to say about the ethical responsibilities that face or should face members of your profession? The answer, obviously, is that I can say very little about the concrete ethical issues that must confront you in the course of your professional duties. So what, in fact, can I talk about? And in particular, what can I talk about in a way that might establish a relation between my own discipline as a moral philosopher and your discipline, of which, I repeat, I know very little. Well two things come to mind. The first has to do with the difficulty one always encounters when trying to define the space in which ethical thought and ethical discourse can most productively take place. That is, how does one define the boundaries in which ethical thinking best occurs? Without establishing such boundaries, responsible thinking is nearly impossible. Unfortunately, laying out a Requests for reprints should be sent to Sharon Bishop, Department of Philosophy, California State University, Los Angeles, CA 90032-8143, USA. E-mail: [email protected]

11 0

BISHOP

relevant space for ethical thought is often difficult to accomplish, but I do have some ideas I want to share with you as to the nature of these difficulties and how they can be minimized. The second area I would like to discuss has to do with the way ethical and professional ends can come together within this defined space. In other words, I will try to explain the manner in which philosophical ethics, my discipline, can enter into a fruitful dialogue with the ethical responsibilities faced by a specific profession, in this case your profession. I begin with the difficulties ethical thought encounters in the professional world.

SOME DIFFICULTIES FOR ETHICAL THOUGHT Although there is a welcome resurgence of serious thinking about the ethical aspects of various professions as, of course, your own conference illustrates, it is also clear that intellectual and political currents of the day complicate issues. For instance, say someone in authority or a position of power wants a test that will be used to sort students into academic programs. The test, you are told, may also be used to establish a minimum standard for university admission. You may also be told that the test might be employed to select among potential immigrants. And in actual fact, there may be other uses to which your test is applied that you know absolutely nothing about. Nevertheless, you are assigned the task of constructing a good test, and you proceed to do so. After all, you are not in charge of the admissions policy of your university or the immigration policy of your country. Still you worry about how your test will be used, how prospective students taking it will be treated, or how it affects who gets legal admission to your country. Of course, you could just follow the line of thought that there is a division of labor between test constructors, admissions officers, and legislators and administrative officials. This would let you consign your worries about the tests and their uses to others so that you are off the hook. But the difficulty is that you are not just test constructors but also citizens of your country and members of a profession. In those roles you have a stake in the fair treatment of your fellow citizens and how your work is used. But what makes for fair treatment and the legitimate uses of your work? And is it not naive or overly idealistic to think that fairness has a role in your activities? What makes for fairness anyway? And how far do your responsibilities extend? I am imagining that these are questions that come up for you and for others who find themselves involved in activities that in some way or another they believe are worth devoting themselves to. What I want to talk about now is a way of thinking that enables those who care and are concerned to make room for their questions, and for answers that have an ethical or moral dimension.

THINKING ABOUT A PROFESSIONAL ETHICS

111

SOME POSSIBILITIES FOR A PROFESSIONAL ETHICS One way of thinking of a professional ethics is that is assembles guidelines for action. The ethical principles of the American Psychological Association (APA), for instance, presents itself just this way, saying that the code “is intended to provide both the general principles and the decision rules to cover most situations encountered by psychologists” (1992, p. 3). One question we may wish to discuss in time is whether or to what extent it makes sense to think of a professional ethics as providing guidelines that are nearly complete in the sense that they cover most situations encountered by the profession in question. But right now, I want to turn to some of the ways philosophers have thought about guides to acting ethically. There are several philosophical traditions about ethical guidelines that are worth looking at. One stems from ancient Greek philosophy, especially Aristotle, and it urges two related modes of action: one, that you do the right thing, in the right way at the right time, or two that you act the way a good or virtuous person would act in a given situation. The advantage of the first formulation is that it calls attention to how complex ethical decision making can be. It is not just a matter of doing it right, but also of timing and manner of action. The disadvantage is that the formulation gives no guidance because it gives no clues about how to figure out what is right or what counts as the right manner or time. The second formulation helps a bit with guidance because it invites us to pick out a good person, imagine how that person would do it, and match that behavior. What it does not tell us, however, is how to pick out a good person. It is always important to keep in mind that ethical judgment is complex, and although asking how a good person would decide an issue is sometimes a useful heuristic, I doubt that this way of proceeding helps as it stands. What it is missing is some substantive characterization of what makes for the ethical adequacy of guidelines. And this is a task that early modem moral philosophers took on. From these early modern developments, two great traditions emerged, and these two are at the center of much philosophical and cultural debate down to and including our own time. One tradition is that the most general ethical guideline directs that we act in whatever way will produce the greatest amount of satisfaction or utility generalized over everyone affected by it. In this tradition, familiarly known as utilitarianism, ethics just discovers the most effective means to the end of achieving the greatest satisfaction, and satisfaction is taken as the only relevant ethical end. Theoretically, this line of thought gives us a way of remaining hard-nosed scientists and at the same time being “ethical” because satisfactions can be measured and quantified -at least in principle. In this system, ethical thinking is reduced to working out the means to the end of achieving maximum satisfaction. Practitioners in many disciplines find it much more comfortable to talk about the consequences of activities in terms of satisfactions or utilities than in terms of rights violated, freedoms infringed, or dignity compromised. And, it certainly seems far easier to

112

BISHOP

quantify consequences than it is to have confidence in conclusions concerning equal rights and human dignity. To my mind, however, the utilitarian view suffers from numerous problems that make it a less than promising way to pursue the idea of a professional ethics. For one, it obscures the fact that there are ethically worthy ends other than satisfaction. It seems a hallmark of the professions that they are in one way or another devoted to something that is seen as worthy in its own right. Professional “satisfaction” comes from successes regarding these ends. But success is not measured by the satisfaction; rather it is measured by the achieving of worthy ends. Second, in traditional utilitarianism, all satisfactions are on a par. Satisfaction in dominating others is no different in principle from satisfaction in helping others. This aspect of the utilitarian tradition is nicely illustrated in a remark by Robert Bork in his unsuccessful bid for a seat on the Supreme Court in which he argued that there is no principled difference between a married couple’s preference for privacy and a gas company’s preference to pollute. Third, the utilitarian perspective has too narrow a view of equality. Because each person counts according to the satisfaction she has, the distribution of resources follows lines of satisfaction rather than the capacities of separate persons. It is more important to be concerned with making equal contributions to the capacities of individual persons than it is to their levels of satisfaction, or so I would argue. A great strength of utilitarian thought is its theoretical simplicity, but that very strength is a fault in working out something such as a professional ethics, which has to deal with more narrowly drawn value commitments. It is hard to imagine, for example, that we would get very far in drawing up guidelines for a profession devoted to education by focusing on satisfaction. No doubt education is related in some way to satisfaction, but it would be odd to think that it is satisfaction simpliciter that is aimed at. A second tradition from the early modem period stems from Immanuel Kant. He is familiar to us all as the philosopher who turned the golden rule into the categorical imperative. Supposedly that requires us to determine ethical conduct by asking whether we would be willing to accept the principle on which we propose to act as a universal law. Although the categorical imperative has been of enormous importance in modem ethical thinking, I suspect that as a guide to moral action it cannot really be instrumental in developing an adequate professional ethics. But there are some other aspects of Kant’s ethical thought that I believe can be very helpful in thinking about a professional ethics—particularly in helping to define a space for a professional ethics.

KANTIAN PERSPECTIVE FOR PROFESSIONAL ETHICS Unlike the utilitarians, Kant drew a sharp distinction between practical and theoretical reason. For him, theoretical reason had the task of understanding objects as

THINKING ABOUT A PROFESSIONAL ETHICS

11 3

they are given to us, whereas practical reason had the task of constructing objects according to a conception of them. A view of this sort about practical reason is called constructivism. To illustrate what I think he had in mind, as language testers your interests are those of theoretical reason in so far as you want to measure language facility, that is, you want to measure an object that is given—it just happens to be degree of language facility. But as you think of a way to measure that facility and think about the appropriate uses of those measures, you are thinking about how to construct objects according to a certain conception. In the first case, the conception is that of a test that really does measure language facility—that is, a test that is reliable and valid. In the second, the conception is that of the responsible or ethical use of tests. For Kant, both inquiries were inquiries of practical reason, but it is only the second that would harbor a full conception of ethical thinking about language testing. One thing to note here is that Kant was thoroughly contemporary in thinking that guides to ethical action are constructed. They are not something that we find or discover in the world. We form an idea of a certain kind of object—say a system of education that embodies a political ideal of equality— and we do the best we can to construct that system. Another aspect of these objects constructed by practical reason is that they are conceived of as being able to affect the way institutions and people develop. Kant’s idea was not just that as practical reasoners we construct objects but also that in constructing them we think of them as objects that change and shape what happens. They make possible what was not possible before. Kant’s famous example here is that ownership or property is not possible without laws that create it. Without laws specifying the conditions of ownership, a person could be in possession of objects but could never own them. (Think here that we speak of people possessing illegal drugs, but not of owning them.) But Kant’s clearest discussion of how constructed objects can shape what happens is evident in his treatise called Toward Perpetual Peace (1996b) where he discussed how the widespread creation of republican forms of government can lead to an indefinitely long peace. He argues that the peoples in such regimes have no reason to initiate aggressive wars against one another. In a community of nations in which the only reason peoples have to go to war is self-defense, war becomes much less likely. And when peoples do go to war, Kant argued that their actions are to be constrained by the idea of reestablishing peace. The peace, in turn, is to be reestablished with an eye to making it last. That is, each time we are required to face a new situation, we are reminded that proceeding ethically requires us to constrain our actions by the thought of maintaining the conditions of ethical life. More generally, and more to our point, he saw the construction of an ethical way of life as a way of making possible a community that treats individuals with the respect and dignity due to persons, a community that he calls a kingdom o f ends.

114

BISHOP

A third aspect of his thinking about morality is that it has a certain content or substance that limits what can count as an ethical value. Negatively, what has ethical value cannot be traded for anything of market value. For him, ethical thought began with the idea of an end that puts limits on market trades, satisfactions, and the way that we get them. To put it another way, ethical values constrain bottom-line thinking—unless, of course, the bottom line is an ethical line. As a result, there is a certain way in which ethical values are final ends; that is, they are thought of as valuable in themselves; they do not get their value because they are a means to some other end. On the positive side, Kant believed that the fundamental values involved in constructing an ethical life are freedom and equality. These are the values required to create and maintain a kingdom of ends. Moreover, these ideas function as constraints on what kinds of institutions and treatment are legitimate, and they also serve as ideals to shape the way people live. Kant did not flesh them out in detail perhaps because there is no way to do that independent of the historical circumstances and context in which practical reasoners live. But what he did suggest is that the ways in which those ideals get fleshed out have to be acceptable particularly to those who would be burdened by them because what makes constraints legitimate in the first place is that they would be accepted by reasonable people. We can pull these aspects of Kant’s thought together by saying that ethical thought works out ways to respect the freedom and equality of persons. Further, it justifies burdens by showing that the person they fall on has grounds to be reconciled to them and no reasonable basis for objecting to them. This means that if I propose to do something that imposes a burden on someone, then I need to be thinking about whether the bearer of the burden has or would have reasonable grounds for objecting. As a short aside, let me say that there is a lot of talk currently in philosophical circles and beyond that constructivism is incompatible with objective thought about values. But there seems little reason to be unduly worried about that in the way that Kant proceded. In his view, as beings who have practical interests, we have or develop ideas about objects that we might have reason to construct. For example, we might have an idea of an object that keeps track of time. There are, as we know, many ways of doing this. It is also worth noting that there is no one way that is the best way. Given the object, there are obvious standards of accuracy and reliability that any adequate timepiece must meet. But beyond that there are better and worse ways of constructing it depending on the other interests we might have. Do we want something small enough to wear? Are we going diving or to a ball? Is it important to be able to see it at night? To keep track of seconds? Or do we want to be able to read it from across a large room? And if we do, then, of course we cannot also have something that can be worn on the wrist. That reminds us that there is not a way to construct a timepiece that realizes all our interests; some will be lost.

THINKING ABOUT A PROFESSIONAL ETHICS

115

So in setting out his idea of practical reason, Kant set us on the road to the idea that the values that guide practical reasoning are not to be found in nature, but are constructions that result from thinking about possible objects that we might work to build. It is also worth noting that so far there is no reason to see this view as giving skeptics or relativists hope. That is, there is no reason to think that objectivity is out of the question, which is to say that there is no reason to think that watches or their standards are not “objective.”

MORE ABOUT OBJECTS It is perhaps not so hard to see the sense in which as language testers you construct objects. After all, you make up the test, and you can hold copies of it in your hand. But the idea of constructing an object may seem stretched if we think of morality as that object. If we recall, however, that these objects are constructed with the idea that they make sense as final ends that not only constrain our activities but also are capable of changing what happens, and that ethical objects are intended to respect the freedom and equality of persons, it may be easier to understand my point. In Kant’s view, the source of morality is the rational and free nature of human beings. But the objects of morality, that is the objects at which the source aims are the various principles and institutions that we construct to create a kingdom of ends in which our rational and free nature may be realized. And the ultimate aim of morality is to be able to live together in peace as free equals. There are many examples we can think of in which standard moral principles serve as objects through which moral ends are realized. Take the principle that promises ought to be kept. It is an object through which we can realize the end of creating conditions of reciprocal trust in the promised outcome. Realizing our end in this way, the principle comes to have special status. Generally, it creates a way to do something, to bind ourselves to a future act so that someone else can count on us. If I have made a promise, I am limited in how or when I can pursue my own interests. More generally, the fact that I promised to address you today puts continuous limits on the use of instrumental or strategic reasoning, by which I mean that it is no longer open to me to reason that I would be happier if I played tennis instead. In addition, the course of human life is shaped through the use of the principle. When promises are kept, human life goes on as promised. When they are broken, conditions are set for various new outcomes characteristic of the practice of promising {\ {broken relations, apologies, legal action, and the like. Finally, the principle sets up the possibility of character traits associated with fidelity to the principle. There are those we can rely on as responsible to keep their word, those who are not reliable, and probably lots in between. Those we can rely on are likely to be people to whom keeping their promises matters.

116

BISHOP

Of course, the principle of promise keeping is not of special interest in thinking about ethical responsibilities for language testers, but it does illustrate how our institutions and practices can come to have special significance as moral objects. In keeping with this possibility, every system of professional ethics should be designed to function in two major ways. One, it gives general ends of the profession, and, two, it sets out ethical principles for the professions. These do not specifically guide action, but they describe what the profession is devoted to and so are, in part, what constitutes the profession. Just as the principle of promise keeping is a central part of what constitutes the practice of promising. In the next section, I look at the professional ethic of the APA to consider how close it comes to being an ethical object of the sort I have been talking about, and where and how it comes up short of that vision.

EXAMPLE OF A PROFESSIONAL ETHICS AND ITS POTENTIAL AS AN ETHICAL OBJECT The APA’s professional ethic is the one I know best. I am struck both with its Kantian-like aims and at how fitfully they are realized. As I mentioned earlier the APA document seeks to provide action guides for most situations encountered by psychologists. It explicitly distinguishes between the general principles that it presents as, one, goals to aspire to, and, two, ethical standards that provide so-called enforceable rules. (APA 1992, p.3) The most general goal is the “welfare and protection of the individuals and groups with whom psychologists work” (p. 3). Among the aspirational goals are principles of respect for the rights and dignity of persons. In this context, the APA document says “Psychologists accord appropriate respect to the fundamental rights, dignity and worth of all people.” It goes on to say, “They respect the rights of individuals to privacy, confidentiality, self-determination, and autonomy.” (p. 3). Kantians would certainly endorse respect for the dignity of persons and their rights. However, treating these principles as aspirational goals misses the force of Kant’s claim that dignity is an end in itself. Such ends structure our reasoning and our attitudes in a way that is different from a goal that we aspire to. When we aspire to a goal, doors to compromise are left open. We adjust easily to failure to achieve a goal so long as we have tried to live up to it. Our goals are regarded as valuable to be sure, but to fail to reach one is not in general a grave fault. To fail to respect a person’s rights or dignity is, however, a grave fault. It marks a failure to be ethical; it misses the very bedrock of ethical life. In short, I must not aspire to be ethical, I must be ethical. The same problem arises with the APA's general principles regarding professional responsibility. It is not merely a matter of aspiration that psychologists are to practice only in their areas of competence, or that they recognize the limits of their

THINKING ABOUT A PROFESSIONAL ETHICS

117

expertise, or that they not make false, misleading, or deceptive statements, and so forth (p. 3). Failures here are grave professional faults, which are flatly outside the constraints imposed by the ends of the profession. And professional ends, such as ethical ends, constrain action because they are constitutive of professional practice. They create ends that I must realize, not aspire to. Like goals they are intended to influence our lives on a continuous basis, but unlike goals they cannot be easily adjusted. If I fail to meet a weight loss goal, I may have missed achieving something that I very much wanted to do and be very disappointed at my failure. However, if I took my failure to show a lack of integrity or to be a source of guilt, I would at least show that I misplaced the role of weight loss in maintaining my integrity. I would not be received in the same way if I practiced outside the limits of my expertise or made false, misleading, or deceptive statements in practicing my profession. These are the kinds of failures that compromise my professional and personal integrity and are appropriate sources of guilt. In the same manner as with ethical imperatives, I must be professional, not aspire to be. It is not just the APA that thinks of principles such as these as aspirations. Often human rights are talked about in the same way, particularly when institutions are criticized for failures to respect them. To quote a recent New York Times editorial, “In every peace agreement negotiated during the Oslo process, human rights were inserted as an afterthought, a politely worded aspiration [italics added] with no teeth and no enforcement” (Pacheco, 2002, p. 27). The editorial writer is, in effect, charging the negotiators with a bad faith adherence to human rights. And this points to a different problem about the general principles expressive of human rights that I wish to mention. Up to now, I have been talking about a philosophical misunderstanding of the distinction between aspirations and principles. This new and different problem is endemic to general principles. Principles such as those of the APA document and the United Nations Declaration of Human Rights are not enforceable as they stand, not because they are merely aspirations, but because they are too general and unspecified. There are good reasons for this. The APA document and the United Nations Universal Declaration o f Human Rights are more like design documents than blueprints. Because they are like design documents, they must be interpreted to realize them in the concrete world of everyday life. That is, they must be interpreted to give them the teeth required. And interpretations are subject to influence from all kinds of motives. Alas, general principles may be used opportunistically as window dressing to maintain a power base or to muddle a political process. The only corrective here is to interpret them under the motive created by the end or principle of the relevant human right. It is important to note that without the principle, it is not possible to have the ethical motive. And with the ethical motive, the principle will not be misinterpreted; it will find an expression. Whatever difficulties there are with that expression will be the result of failures to appreciate the reasonable interests of others or mistakes about the effects of a particular way of instituting rights.

118

BISHOP

This is not easy work. It requires operating from the stance set out by the principles and, at the same time, thinking carefully and imaginatively about the situation. Such thinking would have to include the likely effects of nearer term policies designed to implement the principles and the way these proposals affect particularly those who might be burdened by them. This kind of thinking is easily driven out by desires to maintain power or by the press of time, or by the common enough thought that talk of principles and rights is utopian and naive. Serious commitment to our professional and ethical ends requires thinking of them from the standpoint created by their general formulation and also attending to their implementation in policies and activities in daily life. That is what makes them real.

CONCLUSION General professional and ethical ends and principles are both rough designs for practice and standards for criticism of existing practice. I want now to conclude with some critical comments about issues that I understand have been of concern to your organization. I offer the comments as a way of thinking about the objects that might be constructed to change educational accountability and immigration policy.

Educational Accountability In some recent circumstances, educators and schools are held to being “accountable.” Let us assume that standardized tests are being used, as they sometimes are, to certify that schools or programs have been accountable. Failure to achieve an appropriate score can result in a shutting down of programs, a loss of funding, or even of a school closure. Here accountability is viewed as one of the objects I talked about earlier. It is a set of practices through which an end is to be realized. To put it another way, it is a specification of an end that is regarded as worthy of being realized. Now, the first thing to note about this set of practices is that it is not the kind of accountability we are familiar with. Standard cases of accountability are, for example, like that in which trustees of estates are accountable for how funds are spent or invested. Or members of Congress are understood to be accountable to their constituencies, meaning that they may reasonably be asked to explain or justify their votes. The background assumption is that trustees or representatives understand the responsibilities of their positions and willingly take them on. Keeping a ledger sheet or explaining their votes to constituencies are part of the normal responsibilities of their positions. They are, of course, subject to penalties for abusing their positions. But it is not part of the background assumptions that those occupying the positions are abusing them. Penalties are set in place for the unusual circumstance in which responsibilities are abused. In this traditional notion of ac-

THINKING ABOUT A PROFESSIONAL ETHICS

11 9

countability a person is only responsible to explain his or her actions and give good reasons for them. The educational accountability project that is currently being pushed seems to make an entirely different background assumption; namely, that the persons involved in the schools or programs that do not achieve appropriate scores have somehow abused their positions. Because the described practices take failure to achieve the results as justification for imposing a penalty, they presuppose that the scores are a good measure of failures to meet responsibilities and that the low scores result from abuse of responsibilities rather than some other source. In some circumstances, these assumptions may be correct, but to justify the described practice of accountability, those who want to impose it need to provide evidence that there have been abuses of responsibility on the part of educators or schools. And the only evidence that has been forthcoming is circular: that student test scores are low. Obviously this does not provide an account of why they are low but just assigns responsibility to schools and educators. I do not mean to suggest here that educators and schools are without responsibility in this setting, but I do maintain that the described model of accountability is better seen as a model for abuse of responsibility. A second difficulty with this model is that standardized tests are questionable measures of overall educational progress. There are numerous reasons for doubts. I set aside those having to do with differences in student abilities to perform on standardized tests. My worry is about whether such tests are adequate measures of how well public educational institutions are carrying out their role in educating individuals. I assume that the central end of our public system of education is to educate individuals so that they can participate meaningfully in the political, social and economic life of their communities. A major part of this end is to help individuals acquire what they need in the way of skills, knowledge, sensibility and imagination to be responsible for their own lives, to participate in self-government and economic life, and to enjoy their culture. Having a certain set of skills or meeting content standards is certainly not sufficient for achieving these ends, and it is not even clear that it is necessary to achieve them. Persons with favorable family backgrounds can get very low scores and yet sometimes be economically independent, participate in political life, and enjoy their culture; at the other extreme, persons with very unfavorable backgrounds can have problems and yet achieve very high scores. Someone might object that I have described the ends of public education in such a general way that it would be impossible to design a measure of them short of measuring levels of success and satisfaction over a lifetime. The main thrust of my argument is, however, that those who want to use the tests make a better case for the relation between test performance and achieving the broader ends of public education. I suspect that multiple measures are better at making adequate assessments of these ends and that efficiency and quantifiability are probably the main reasons to use standardized tests.

120

BISHOP

A better model would interpose another step before imposing penalties or closing schools. It would also require a different rhetoric about the accountability process and a different step in response to standardized tests. Institutions of public education involve numerous stakeholders. For one, there is the public at large; there are also persons who have trained to serve as professionals in educational systems; and, of course, there are the parents and children who are the direct recipients of the benefits of the system. The public has a large investment in the system, and those who run and work in the system are not only invested in it but also accountable to fulfill their responsibilities. Broadly, the responsibilities of professional educators are to make a good faith effort to contribute to students’ acquisition of skills, knowledge, and sensitivity to participate in and enjoy their culture and, at the same time, support themselves as independent adults. If those efforts are inadequate, then the proper first step is to look for the source of the failure. The role of tests would not be to determine who is failing to impose penalties or decrease funding, but to function in the diagnosis of problems that is part of a continuous improvement project. A model such as this also has the advantage of creating the social space to respect and to make use of the professionalism of educators. It starts with the rebuttable assumption that those involved in the system are conducting business in the best way they know, and it enlists them in developing an understanding of the problems and in producing solutions to them. In describing this model, I am not so much arguing that it is a solution to an actual ethical problem as I am indicating something of what I take to be the appropriate starting place for an ethical solution to problems of accountability in education. It is an approach that also acknowledges the prior investment in ongoing educational institutions—the public investment as well as that of individuals who have trained to become educators. Finally, it is based on skepticism about whether standardized tests alone are valid measures of whether schools are meeting educational ends.

Immigration Finally, I want to turn to an ethical issue that some members of your group have raised about the use of language tests to determine immigration status. From the point of view of a professional ethics for language testers, this seems more difficult than the problems surrounding accountability or even admissions policy. The central reason for this is that educational ends are plausibly among your shasred professional ends. And you have specialized expertise in what language tests show with respect to educational ends. By contrast you do not have direct professional interests in immigration policy, nor do you have special expertise about immigration policy. Yet tests constructed and perhaps administered by language testers have been used to determine who is entitled to legal immigration.s This has two implications for you as a professional. One is that your work may

THINKING ABOUT A PROFESSIONAL ETHICS

121

be used for purposes that you believe unethical. Second, this is not necessarily a conviction that you will share with your professional colleagues. So, it is difficult to see how it could become a principle for your profession, however strongly you feel about it. On the other hand, you may judge that the policy of excluding persons based on the fact that they cannot pass a test of English as a foreign language is unjust policy. That is, your reasoning may include a less than favorable assessment of the history of immigration policy. There is considerable evidence that the policies in many liberal democracies are based on motives that could not be spoken publicly. That is, the operative reasons have been ones that would not be acceptable to all citizens of a democracy much less to those who are excluded under the policies. They have been based on wishes to exclude those who are different rather than on grounds that anyone would have reason to accept or understand as legitimate grounds of exclusion (see Dummett, 2001). Second, you may recognize that you do have specialized knowledge that is relevant to this issue because you know more than most about what the test measures. In addition, you are very likely to be in a position to assess the importance of a given level of facility with a language in building a constructive life. Then you may decide to devote yourself to developing a more fully articulated position on immigration policy and language testing. Even though it is unlikely that you would be able to get a principle adopted among the standards for your profession, you would be developing positions that are of special interest to professionals because it is their tests that are being used to implement the exclusionary policy. You might also be doing what you could about the policy because you can do it, that is, you might take it as a final end that is worth devoting yourself to for its own sake or for the sake of justice. The articulated positions you develop are the objects through which you realize your end. The position you arrive at does take into account the persons who are affected by immigration policy. Although it may involve serious dialogue with the various stakeholders, it is not merely the outcome of that dialogue. Rather it represents your judgment about whether there are reasonable grounds for objection to the policy from any of those affected by it, particularly those excluded by it. Finally, I want to call attention to something that I think the two previously mentioned issues show us about the idea of a professional ethics. The APA document wanted to provide adequate guidelines for most situations that psychologists confront; but neither of the above issues involves commonly occurring situations. They are like a lot of troublesome ethical problems in that they arise from some new use of a technology or from a new technology. They require us to extend ethical thinking to issues and situations that we have not confronted before. And here it does seem important that we have a firm grip on what is of ethical significance rather than a set of principles that purports to tell us how to act in frequently occurring circumstances.

122

BISHOP

REFERENCES American Psychological Association. (1992). Ethical principles o f psychologists and code o f conduct. Author. Dummett, M. (2001). On immigration and refugees. New York: Routledge & Kegan Paul. Gauthier. D. (1986). Morals by Agreement. Oxford: Oxford University Press, Kant, I. (1996b). Toward perpetual peace. New York: Cambridge University Press. Pacheco. A. (2002, April 10). Life under siege. New York Times, p. 27.

LANGUAGE ASSESSMENT QUARTERLY, / ( 2&3), 123-135 Copyright © 2004, Lawrence Erlbaum Associates, Inc.

Stakeholder Involvement in Language Assessment: Does it Improve Ethicality? Rama Mathew D elhi U niversity

The debate on language assessment ethics in recent years points to the increasing awareness among professionals about the need for ethical behavior in language assessment. However, what is of concern is that not all testers put ethical principles into practice. A dimension that could account for the lack of professional standards seems to be the noninvolvement or the need for proper involvement of stakeholders in language assessment. This article1 examines two testing situations in India and addresses the question of whether and to what extent involving stakeholders in language assessment improves ethicality.

The notion of stakeholder involvement as a dimension of professional ethics has been emphasized in recent discussions on ethics and fairness in testing. To understand the nature of a stakeholder approach to language evaluation vis-a-vis the issue of ethics, we need to locate it in the larger framework of ethical theory. Rawls (1999) considered justice as a virtue of social institutions. In a well-ordered institution, there is a public understanding of what is just and unjust, that is, the principles of justice underlie a “contractarian theory,” which presupposes that the persons involved in the contractual situation know about themselves and their world. Fruitful cooperation toward a common goal is possible only if all the stakeholders are sufficiently equal in power and ability, to guarantee that in normal circumstances none are able to dominate the others. If, on the other hand, the stakeholders are at different levels, for example, one is a “giver” and the other a “receiver,” there is an unequal relationship, and it leads to results that are not for everyone’s good.*

Requests for reprints should be sent to Rama Mathew, Faculty of Education, 33 ChhatraMarg, Uni versity of Delhi, Delhi-110 007, India. E-mail: [email protected] ’My sincere thanks to Antony John Kunnan and Dan Reed for their very helpful comments on an earlier version of this article.

124

M A THEW

Rawlsian principles are very much Kantian in nature. Kant’s (cited in Russell, 1945) ethical system focused on ideals of universal law and respect for others as the basis of morality. All men should count equally in determining actions by which many are affected. He states emphatically that virtue does not depend on the intended result of an action, but on the principle of which it is itself a result. Freire (1972) also opposed the banking approach to education in which the educator’s role is to regulate the way the world “enters into” the students. The dichotomy between man and the world revolves around how man is merely in the world and not with the world; man is spectator, not recreator; therefore man resists dialogue between the two. Problem-posing education, on the other hand, responds to the essence of consciousness, unveils reality, rejects communiques, and embodies communication. In a more recent discussion on social research, Cameron, Frazer, Harvey, Rampton, and Richardson (1993) identified three frameworks for conceptualizing the relationship between the researcher and the researched: ethics, advocacy, and empowerment. In their discussion based on research projects, they found the empowerment framework least problematical, although to do social research “on, for and with” participants is certainly not a simple proceeding. For this, they suggested interactive and nonobjectifying methods to gain richer insights into participants’ own understanding of their behavior and to engage them in dialogue about these understandings. Corson (1997) argued for critical realism to become a guiding philosophy for applied linguistics, and made a case for three main principles of ethics to be considered in any moral decision-making situation: the principle of equal treatment, the principle of respect for persons, and the principle of benefit maximization. According to Corson, a critically real approach would respond to these charges by devolving its research and decision-making processes down as much as possible to the least of the stakeholders. The need to involve stakeholders particularly in language assessment has been emphasized in recent discussions on ethics and fairness in testing. Shohamy (1997) urged testers and test-users to consider ways of minimizing unethical uses of language tests and suggested that it is essential to seek more democratic models of assessment. However, Weiss (1986a) contended that although the stakeholder approach to evaluation holds modest promise, it will not ensure that appropriate and relevant information is collected, nor will it increase the use of evaluation results. Rea-Dickins (1997) although agreeing with Weiss’s position, emphasized the need to problematize the role of the stakeholder in language testing. In summary, then, how can we put these principles into practice? I would argue that we will need to consider the different aspects of involving stakeholders very seriously—the implications of their position vis-a-vis the power they wield within the structure of the institution or society, their priorities and interests, the aspect of feedback and dissemination—and shape them into a coherent “package.” The package would be a result of mutual agreement among those engaged in it, keeping

S T A K E H O L D E R IN V O L V E M E N T IN L A N G U A G E A S S E S S M E N T

125

in mind what counts for just and unjust behavior. If we have reasons to believe that the stakeholders are at different levels, then the responsibility of striving for equality in their status would be an additional agenda for the given testing situation. It is only when we adopt an inclusive approach that every stakeholder has an equal chance of being empowered. Regarding the methodology of arriving at the package that reflects everyone’s interests, which would have to be based on universal ideals such as honesty, morality, and the like, we will need to employ interactive and nonobjectifying methods, possible only through free and open dialogue with everyone concerned. To treat some as more capable of decision making than others would be to do violence to the principle of equality and justice. Taking on everyone’s agenda does not mean that testers must subordinate their own agendas. Rather, there should be ongoing negotiation to ensure that everyone’s needs are addressed. Having said this, I would now like to examine two testing situations in India— one, a high-stakes achievement test at the secondary level, and the other a national proficiency test launched by a deemed university (the Institute, henceforth) for assessing language proficiency of 16+ users of English as a second language (ESL)—to understand the nature of stakeholder involvement vis-a-vis ethical behavior. In the light of the evaluation of the two projects, I would like to raise some issues that need to be addressed in our future work.

EXAMPLE ONE: THE CBSE PROJECT The Central Board of Secondary Education (CBSE), a national board in India, is based on a prescribed syllabus, and the end-of-course exam tests its achievement. The Board introduced a new English curriculum at Class IX level in 1993, and the new exam was revamped from a content-based (memory-based) test to a largely skill-based test. The Curriculum Implementation Study (CIS) was set up in 1993 to monitor and evaluate the implementation of the curriculum, and the first new exam took place in 1994. I was the Project Director of the study. Here I examine the different aspects of the test in relation to the involvement of stakeholders in the test.

Test Design About 50 experienced teachers were involved in the designing of the curriculum that included the test as well as teacher orientation to the new curriculum. Of these about 15 were involved in the design of the test. Although the teachers involved in deciding on the content and design of the test wanted the exam to be a skill-based one in keeping with the communicative approach to teaching and learning, it was vetoed on the grounds that the weak learners, a substantial number in CBSE

126

M A THEW

schools in the country, would find the test difficult. This was perhaps necessary, because the Board had to play its cards safely. Indeed, any innovation to be practically feasible cannot be drastic; and large-scale orientation to a totally different approach to teaching-testing was not an easy task. However, the Board did not visualize a gradual change to a totally skill-based test in a phased manner. More important, a dozen teachers who did not represent the entire cross-section of teachers in the country could not have voiced teachers’ and students’ concerns. The CIS, which was based on a stakeholder approach to curriculum evaluation, brought into its fold many more teachers to monitor and evaluate the curriculum from the “inside” and involved students, principals, and parents in understanding how the curriculum was being transacted and how it could be improved. It adopted a teacher-as-researcher approach, enabled teachers to be circumspect and to problematize the curriculum so that it could be adapted to the differing demands of a variety of contexts. It was during the project that the test design was looked at in actual situations.

Preparing for the Test The CIS threw adequate light on the different ways in which teachers and students came to grips with the different aspects of the test, such as the test format, scoring criteria, and test difficulty. More important, the project examined the continuous assessment scheme, recommended as part of classroom-based assessment (not to count for the grades in the final exam). The following were the highlights of the findings:2 1. The wash-back effect of the paper and pencil test was clearly visible at different levels. Teachers (and students) saw rehearsing “model” test papers as the most important part of preparing for the exam. The syllabus was completed 2 to 3 months prior to the exam after which everyone was seriously engaged in exam preparation. The importance given to the final exam was also reflected in the year-end exams from Class V upward following the same test design, test format, and scoring criteria. The continuous assessment scheme, which focused on activities such as role playing, group discussions involving listening, and speaking skills meant only for classroom purposes, did not do very well in a majority of schools. The test thus became the de facto new curriculum (Shohamy, 1997) that was different in nature and scope from the official curriculum. The teacher-orientation programs conducted during CIS to help teachers understand formative as well as summative testing seemed to benefit only the more resourceful and competent teachers. 2Details of the study are available in the final report of the project (CBSE-ELT Curriculum Implementation Study, 1997).

S T A K E H O L D E R IN V O L V E M E N T IN L A N G U A G E A S S E S S M E N T

127

2. Most teachers defined their role as helping even weak learners pass the exam (this is the role expected of them in all types of schools) and therefore spent a lot of time on teaching students to answer the “seen” literature section. An analysis of the answer scripts across different regions of the country revealed that the low scorers (in the range of 30%^10%) had all gotten nearly 100% scores in the literature section, which could be rehearsed/crammed to quite an extent. Therefore teachers and students had learned ways of coping with the new curriculum in minimalist ways that did not affect their traditional approaches in any significant way. 3. The study indicated that based on the performance in the exam, if more detailed feedback could be given to students and teachers in terms of section totals, problems faced by markers, the need to develop the scoring criteria, and the descriptors, it would go a long way in improving the performance, grading, as well as the use to which test results could be put. The Board could not, however, take on this burden only for English. It meant more work and more problems. 4. The Board was skeptical about the findings from the study during the first two years: When teachers expressed dissatisfaction with the different aspects of the test, such as the rigid format of grammar tests, faulty reading comprehension questions, nonrepresentativeness of different types of reading texts, or the distorted purpose of the literature section, it was interpreted as a hasty or uninformed view because the Board felt that the new curriculum had not been given an adequate chance. As the study progressed and as more teachers, parents, and students understood the instructions from the Board, they also learned to adjust their thinking and teaching-testing to make the best of the curriculum. When the study concluded and a detailed report was submitted, it was seen as finished; no follow-up to the study was even considered necessary because the Board saw its responsibility as one of extending the project experience to other languages, such as Sanskrit, French, Hindi, and so forth. English in their view had already been overfocused.

Discussion The stakeholder-oriented evaluation of the curriculum it seemed did not yield results to the extent expected of such a marathon research effort due to several factors: the Board up to that point was an exam Board and only oversaw exam-related and certification functions— setting hundreds of papers in different subjects for use in 3,000 schools in the country and abroad, maintaining secrecy, arranging for the grading of scripts and announcement of results. Although the education officers took on extra responsibility with the curriculum project, neither the structure nor their expertise allowed them to go beyond some rudimentary work. It was a bureaucratic setup where the chairman made decisions, and discussions of academic work, such as research, training programs, problems faced by students or teachers, were seen as items of agenda at the not-so-frequent meetings. The project with its

128

M A THEW

strong stakeholder orientation had not only provided for stakeholders’ views on the curriculum through meetings, orientation programs, review workshops, seminars, and dissemination of reports, and so on, but it had also illuminated them within the formative evaluation framework. It had democratized the process of curriculum implementation where teachers or students made meaning of the prescribed syllabus in ways that were appropriate to different needs and pinpointed problems that could be rectified. The textbook was no more seen as a document, frozen in time and space, that had to be covered from the first page to the last but as something that could be juggled around to achieve the language objectives. This democratization process was clearly in conflict with the bureaucratic setup of the Board. However, the illumination that took place among the stakeholders, as mentioned previously, could itself be seen as an important and positive outcome of the stakeholder approach to evaluation. The project enabled teachers and students not only to consciously negotiate the process of test preparation and test taking and to explore ways of coping with it, but also to go beyond it. However, the test, mandated by the Board, in a sense, is fixed in stone and is non-negotiable. There was thus a collision between a static given, the test (with all its problems), and a dynamic-ever-changing research approach to testing. Teachers needed support from schools, and eventually the Board, to sustain the “research approach” and momentum generated during the project. It was seen that only a few schools could absorb the innovation the way it was intended, that is, only a few schools managed to create a learner-centered classroom with focus on meaning. Teachers in these schools were not constrained by the curriculum, and they derived satisfaction from being able to do something meaningful for themselves and for their children. They were system free, silent innovators working in isolation. The not-so-motivated teachers, clearly a majority, employed a result-oriented approach that was least risky because the final exam was the ultimate test of teachers’ and students’ performance. They knew, it seemed, that the initial excitement of the project would settle down to a normal, if sometimes dull, classroom routine. The more motivated teachers were slightly frustrated and disillusioned because the process of fine-tuning of classroom processes that they had experienced was not found useful. A tracer study conducted three years after the life of the project corroborated this point further (see Mathew, 2001 for more detail). The CIS project, which served as a pivot for carrying stakeholders’ voices to the Board, could not fulfdl the expectations of the empowered teachers or students. This is however the project’s perspective. According to the Board, we had overstepped our brief in involving too many teachers and students unnecessarily—they expected some factual, “value-free” (Scriven, 1986) evidence of how the curriculum was being received (not how it was taking shape) in a limited number of representative schools. That the school-type categories, for example, government, private and others, given to us by the Board defied any rigid and neat definition did

S T A K E H O L D E R IN V O L V E M E N T IN L A N G U A G E A S S E S S M E N T

129

not appeal to them. When it was reported that quite a few teachers in a majority of the schools found it difficult to understand the grading scheme, let alone convey it to students, it was interpreted as “it is too soon to come to any conclusion,” “teachers need more time to understand it,” and “more training programs are needed.” The training programs conducted by the Board only helped teachers to narrow the scoring criteria into easily definable, sometimes distorted, categories. Clearly the agenda of the Board did not match the project’s ideology or the actual ongoing qualitative outcomes. This did not mean that the project went against the Board’s basic philosophy of curriculum renewal through teachers’ professional development and vice versa. In influential circles, the Board expressed a sense of ownership of the project, and they took pleasure in making the other national organizations that did not initiate any such reform envious of their achievement. However, there is a matter of concern that needs our immediate attention. If, as Weiss (1986b) pointed out, stakeholders are defined as all those who are affected by the assessment or evaluation, then the project should have included the Board as one of the stakeholders. It was probably the responsibility of the project to illuminate them in the process of evaluation, just as it saw as its responsibility the empowerment of teachers or students. This, I think is the biggest challenge stakeholder-oriented evaluation faces. Left to themselves, stakeholders do not want to say or do anything—it is only when they are coaxed into the role of respondents, they “will make an effort to tell what they need to know, but much of what they say is a learned, stereotypical response” (Weiss, 1986a, p. 190). If Boards would like to maintain status quo and want to hear that everything is going well, then why sponsor an evaluation study? It is perhaps fashionable or a matter of prestige to sponsor or do evaluation studies, but it does not mean that evaluation results will be treated with any seriousness, be it qualitative or quantitative. Sponsors are tough-minded value-free disciples (Scriven, 1986) who are threatened by interpretations. On the other hand, to do or to ask for a stakeholder-oriented evaluation regardless of what use it will be put to may still be better, at least for the empowered stakeholder, than not involving stakeholders at all. The second example I want to describe here illustrates this point.

EXAMPLE TWO: THE NELTS TEST The National English Language Testing Service (NELTS) test was launched by the Institute in 2000 for countrywide proficiency testing of the 16+ age group individuals. This has been administered twice and I, along with a few other like-minded colleagues, was involved in it from its inception to a little before its first administration. I was also the Academic Coordinator of the project.

130

M A THEW

The NELTS test was visualized as a proficiency test for assessing the language proficiency of individuals at different levels regardless of their age, background, qualifications, experience, and the geographic area they represented. It also had two other aims, that is, to bring about curriculum reform through a beneficial backwash of the tests and to eventually replace the existing exam system in the country. In its initial formulations, the project aimed to adopt a stakeholder approach by involving various stakeholders at various levels of test development and administration, such as test takers, teachers, authorities concerned with syllabus-based exams, as well as agencies concerned with admission tests in the country.

Design of the Test The test was to a large extent in response to the fact that there was no single instrument, independent of any prescribed curriculum, available in the country that helped with language proficiency assessment. Therefore, it was decided to have a test of reading and writing skills to start with and to take up the testing of listening-speaking skills in the second phase. This was to be in the communicative format because it was felt that the users of test scores (not specified) would need to know how well the test takers could read and write in actual contexts.

Trialing of Tests This was not considered necessary because according to the Vice Chancellor (VC), the sponsor, the testing team was experienced in the field (i.e., teacher training, materials production, not necessarily testing). Further the time frame given to the team did not allow any trialing, although they were not averse to the idea if the team wanted to somehow do it. In fact, a small fund was advanced for the purpose, but generally the feeling was one of impatience and a lack of trust in the team’s ability to produce a test by the agreed date. I, however, proceeded to do the trialing on a national scale on different representative samples. A part of this study was reported at the SCALAR 4 Conference of 2001 (Mathew, Ramadevi, Fatima Parveen, Chaturvedi, & Rama, 2000).

What Happened After Trialing? At a meeting of the VC and the team members, it emerged that some of the members were dissatisfied with the design, that is, the absence of a section on grammar and vocabulary; because these were important areas, the candidates would want to know their performance in these two areas. This dissatisfaction corroborated an earlier attempt by the VC and another set of “experts” to quickly produce a set of

S T A K E H O L D E R IN V O L V E M E N T IN L A N G U A G E A S S E S S M E N T

131

books that would help testers to take the test. This, we had managed to abort on the grounds that until the test is in place, producing help books would be like putting the cart before the horse. At this stage of the project, there was no concern expressed about the need to pilot the test with a norming sample from the intended test-taking population. This task would of course be daunting, given the range of test-taker characteristics in a country such as India in terms of socioeconomic and language background, geographic region, prior education and experience, and so forth. In fact it was felt that the trialing exercise would be futile as it would only prove what we set out to prove. In what was seen as a kind gesture, it was suggested that the data that we would gather through the first administration could be used for analyzing the test that we so desperately wanted. Three of us in the team quit the project because we could not compromise any further on matters relating to test design, test development and grading. Our letter emphasized this point: There can be several models o f language proficiency resulting in different test designs, each valid in their own right. However there are compelling reasons fo r working within a communicative language model. Only this will ensure a beneficial backwash on the teaching and learning o f English in our country given the low standards that prevail. Further ifNELTS test has to compete with similar tests o f international standing, we need to put the test through rigorous and recursive trialing procedures at a national level before it is finalized. ... We are unable to make further compromises on the academic aspects mentioned previously which we feel will affect the quality o f the test. (March 28, 2000)

I was allowed to step down as Coordinator but was asked to continue as a member of the reconstituted team, because, according to the VC, “the group will certainly benefit from your long-standing expertise in testing and evaluation,’’which I declined. My reasons were A.s' mentioned in our earlier letter and in related discussions ... our team had been working with the goal o f developing a set o f standardized English language tests fo r the country. ... As a student o f testing, I will not be in a position to endorse a test unless the necessary test development procedures are employed, since the validity and reliability o f the test as an instrument fo r making judgments about candidates ’ language ability across the country could be questioned. These procedures in my view are extremely crucial and cannot be compromised. (April 7, 2000)I

I was allowed to quit the team. The test has been administered twice and graded. It will probably be administered again.

132

M A THEW

DISCUSSION This exercise illustrates an effort of an evaluator or sponsor who gets a group to work on a test from his perspective. The test has gotten stabilized over two administrations, and the team members are appointed from the Institute faculty by rotation. On the one hand, the sponsor asserts that for our clientele in India, an ordinary question paper (such as the ones used by agencies such as the Union Public Service Commission, Railway Recruitment Board, Banking Services Recruitment Board, which put together their tests in utmost secrecy with the help of “experts”) will do. On the other hand, the Institute uses its long-standing reputation to market a test put together by the faculty that may not be valid or reliable. In fact, Jacob Tharu (personal communication, 2001) asserted that the candidates, who take the test even if it is just to know how good their proficiency is or to obtain a certificate from the Institute that may not have any immediate practical value, would need to have a score that is reliable, valid, and usable. To argue that no one is seriously affected by the test because it is not a high-stakes test would amount to abusing an academic effort. This knowledge, however, is available only to the testers, not to the other stakeholders. This raises some important questions: Does not inaccessibility of information amount to a breach of contract (see the point about contract theory previously mentioned)? Do the clients know this? Is it ethical? Is it fair? Whose interests does the test serve? MacDonald (cited in House, 1990) observes that evaluations usually serve the interests and purposes of bureaucratic sponsors or an academic reference group at the expense of those being evaluated. Because ethics are essentially to do with a public system of rules and are contractual, should not the Institute have involved the stakeholders in a “free and open dialogue” before and during the testing process? There is another point we need to consider here. The Institute is a government organization and a premier institution in the country for ESL teacher education. We have all along advocated teaching methodologies, evaluation procedures, and curriculum frameworks that are novel and that help improve standards of English teaching in the country. It has thus an education angle to all its activities. All the teachers and researchers on our programs look up to the Institute for panacea in the teaching-learning context. When a proficiency test is launched here, it is with the same expectation—a “good” test with good washback that they could take back to their own contexts. In fact, the markers for the NELTS test all come from different teaching backgrounds. Therefore the responsibility of the Institute toward teachers, students, and the entire community is much more serious and complex than if the test had been launched by a private business house. This was what was initially planned for in the piloting of the test—involvement of teachers and students in concretizing the test design, construction, format, rubrics, scoring criteria, and grading—to empower them to make informed decisions. Because this was not considered necessary, the lessons we learned from the trialing exercise were not incorporated into the test-development process. When we insisted that we would

S T A K E H O L D E R IN V O L V E M E N T IN L A N G U A G E A S S E S S M E N T

133

have to follow certain basic procedures in our test development, our VC was categorical in his view that we did not have to follow Western ways of doing things. This relativist approach, as Davies (1997) pointed out is definitely untenable, especially in the field of education. The issue therefore remains valid especially from an ethical point of view: Bringing about awareness among the stakeholders about ethical testing practices is a difficult but important responsibility of an educational institution that cannot be ignored or taken lightly.

UNRESOLVED QUESTIONS We have here two examples of evaluation that have been in a sense difficult to evaluate from within. Given the broad premise of stakeholder involvement, the NELTS test could be dismissed as unethical, and the CBSE test could be accepted as ethical to some extent. As Cameron et al. (1993) suggested, to do social research on, for, and with stakeholders is certainly not a simple proceeding. We will need to more closely examine the projects with all their complexities. In the case of the CBSE project, although it had the backing of a monitoring and evaluation phase, the opportunities afforded by a stakeholder-oriented approach were not fully used. By empowering an array of concerned groups to play an active part in the evaluation, the project tried to make ethics and fairness a central tenet. However, there were problems at least at two levels: one at the level of the users of the curriculum, mainly teachers who were instrumental in mediating it inside the classroom; and two, at the level of the Board who initiated the change. Although several individuals (teachers, principals, students, parents) found themselves elevated to the position of “change agents,” there were many others untouched by the project’s intentions and influences. This was because the study in spite of its magnitude simply could not access the entire CBSE-school community in the country. Even those teachers who were excited by the new role the project had helped them to take on did not have the needed support at the school and Board level. The others were cynical, saw the innovation as a passing phase, and waited for things to settle down. The problem at the Board level, as mentioned earlier, was one of conflict between a bureaucratic or top-down approach to curriculum renewal and the democratizing process the study initiated through the stakeholder or bottom-up approach. This was compounded by the unequal power relationship—the Board administers an end-of-school high-stakes test and scores of students in the country take the test with no say in the matter— and the hidden agenda that sponsoring an evaluation need not necessarily mean that future actions would be based on evaluation results. Indeed, it is difficult to bring about change among many of the stakeholders because they are sympathetic to the fact that policymakers cannot please everyone, especially in a large and diverse country such as India. An unusual ethical position suggested by House (1990), however, occurs when evaluators make themselves

134

M A THEW

vulnerable to those evaluated, thus redressing the balance of power between the two parties. Another question is the extent to which the developmental and dynamic nature of a large-scale study meets the expectations of a sponsor who sees value in summative evaluations and outcomes. The main funding agency (Department for International Development, England) was at many points dissatisfied with the unintended outcomes and the extension of a project framework to include many more questions that were not anticipated at the start of the project. They were unable to accommodate the qualitative and ethnographic orientation the study adopted and were interested in the more discernible and more easily usable outcomes. Does this imply that stakeholder approaches to evaluation should, to qualify as ethical and to be of practical relevance, take a positivist stance and be summative or quantitative? Regarding the NELTS experience, I quit on ethical grounds, but was that a professionally unethical move, to watch an unethical practice from the outside as it were? Bernard Spolsky (personal communication, 2001) commented that it is only when we are part of the evaluation process that we get a voice and can make a difference, not when we are outside it. I do not subscribe to this for the following reasons: Responsibilities and rights go hand in hand. If so, to fulfdl our responsibilities as testers, we need to have some rights: to have adequate funding for quality test development, to be given the truth about the intended purpose of the test, to be permitted to tell the truth about all aspects of the test, to be given access to all stakeholder groups, and to express expert views on how tests impact different stakeholders. In the absence of these academic facilities, I could not fulfdl my responsibilities as a tester. That, however, does not take away my right to critique the testing context as an outsider; even if I had been part of the team, I would have considered it my right as well as responsibility to evaluate it from the inside, just as I have tried to do with the CBSE study. I think that an ongoing evaluation of any testing situation from an insider as well as an outsider perspective is what lends it credibility and in turn enhances ethicality. We have to be wary of the impoverished agendas that organizations have in terms of fairness issues, reliability and validity issues, absence of bias, and other issues such as opportunities to learn and testing accommodations. Unless we, the members of these organizations, make systematic and ethical procedures a serious business in our testing work, it may become just a political exercise. It was some time ago that Spolsky (1981) urged all testers to be committed to the improvement of tests to the highest level of their knowledge. His plea, unfortunately relevant even today, was To avoid mysticism, to keep testing an open process, to be willing to account for our tests, and to explain to the public and to the people we’re testing what we’re doing. When we’re forced to keep reexplaining what we’re doing, then we’U be forced, I think, to be more honest in what we do ourselves. For testing to remain open, it must be possible for the layman to understand what we’re doing, (p. 20)

S T A K E H O L D E R IN V O L V E M E N T IN L A N G U A G E A S S E S S M E N T

135

In conclusion, I would like to stress that although the stakeholder approach to assessment has modest promise, there are unresolved issues that need to be addressed by carefully examining the role of stakeholders vis-a-vis their relationships with one another and by prioritizing their concerns on the basis of a relative weight assigned to them. It is important for all the stakeholders to come together and openly agree on a common framework for evaluation. It is only when things are transparent and fair that ethicality begins to emerge. Davies (1997) rightly commented: “This is the proper role for the profession: that it states just what imposition is made on the public and why it is necessary to use the method it does” (p. 335).

REFERENCES Cameron, D., Frazer, E., Harvey, R, Rampton, B., & Richardson., K. (1993). Ethics, advocacy and empowerment: Issues of method in researching language. Language and Communication, 13, 81-94. CBSE—ELT Curriculum Implementation Study (1993-97): Final unpublished report. (1997). Hyderabad, India: Central Institute of English and Foreign Languages. Corson, D. (1997). Critical realism: An emancipatory philosophy for applied linguistics? Applied Linguistics, 18, 166-188. Davies, A. (1997). Demands of being professional in language testing. Language Testing, 14, 328-339. Freire, P. (1972). Pedagogy o f the oppressed. London: Penguin. House, E. (1990). Ethics of evaluation studies. In H. Walberg & G. Haertel (Eds.), The International Encyclopedia o f Educational Evaluation (pp. 91-94). New York: Pergamon. Mathew, R. (2001, May). Tracing the after-life o f teacher development programs: Reopening closed chapters. Paper presented at the annual meeting of the American Educational Research Association. Seattle, WA. Mathew, R., Ramadevi, S., Fatima Parveen, M„ Chaturvedi, S., & Rama, K. (2000). Testing the tests: Results from trialling. CIEFL Bulletin, 11, 1-15. Rawls, J. (1999). A theory o f justice (Rev. ed.). New Delhi, India: Oxford University Press. Rea-Dickins, P. (1997). So, why do we need relationships with stakeholders in language testing? A view from the UK. Language Testing, 14, 304—314. Russell, B. (1945). A history o f Western philosophy. New York: Simon & Schuster. Scriven, M. (1986). Evaluation as a paradigm for educational research. In E. R. House (Ed.), New directions in educational evaluation (pp. 53-67). London: Falmer. Shohamy, E. (1997). Testing methods, testing consequences: Are they ethical? Are they fair? Language Testing, 14, 340-349. Spolsky, B. (1997). The ethics of gatekeeping tests: What have we learned in a hundred years? Language Testing, 14, 242-247. Weiss, C. (1986a). The stakeholder approach to evaluation: Origins and promise. In E. House (Ed.), New directions in educational evaluation (pp. 145-157). London: Falmer. Weiss, C. (1986b). Toward the future of stakeholder approaches in evaluation. In E. House (Ed.), New directions in educational evaluation (pp. 186-198). London: Falmer.

LANGUAGE ASSESSMENT QUARTERLY, / ( 2&3), 137-150 Copyright © 2004, Lawrence Erlbaum Associates, Inc,

A Code of Practice and Quality Management System for International Language Examinations Piet Van Avermaet, Henk Kuijper, and Nick Saville ALTE Code o f Practice Working Group

The Association of Language Testers in Europe (ALTE) was formed in 1990, and there are now 27 institutional members and associates representing 24 European languages. A key objective of the association is to establish professional standards for all stages of the language-testing process to provide language learners with access to high-quality examinations, and in 1994 a code of practice was published. This article discusses the ways in which ALTE members have attempted to put their code into practice, and in particular how quality management systems can help to bridge the gap between theoretical issues and practical reality. A number of key issues have arisen in carrying out this work: The first covers self-assessment as part of an ongoing process of quality improvement; the second covers the question of minimum standards and how these should be arrived at. In both cases the question of how to achieve a reconciliation between diversity and standards arises, especially in relation to the many different organizations that now make up the ALTE membership.

BACKGROUND The Association of Language Testers in Europe (ALTE) was formed in 1990 at an inaugural meeting hosted by the Generalitat de Catalunya in Barcelona. The meeting was attended by members of a number of leading European institutions involved in the assessment of their own languages as a foreign languages. Representatives from eight of these institutions who attended that meeting became the founding members of ALTE under formal arrangements for the establishment of an association that had been agreed on in 1991. Requests for reprints should be sent to Nick Saville, University of Cambridge, ESOL Examinations, 1 Hills Road, Cambridge CB1 2EU, UK. E-mail: [email protected]

138

VAN A V E R M A E T , K U IJP E R . SA V IL L E

It became clear from the start that members of the institutions present in Barcelona shared common concerns in language assessment, and the establishment of a framework of levels for international language examinations was discussed at length. There was also a concern for high standards in language assessment and for fair treatment of the candidates who take the exams. It was recognized that, if the exams were to be placed on a framework of levels, they would also have to be compared in other ways besides level of difficulty. This would need to cover many areas of the testing process, including item writing, administration, marking, grading, test analysis, and evaluation. In addressing these matters, it was realized that a long-term perspective was needed, and indeed these shared concerns have provided a focus for ALTE members to work together over a period of more than 10 years. Since 1990 the members of ALTE have been meeting regularly (twice a year) and have collaborated on a range of joint projects (some funded by the European Union’s Lingua Bureau). There are now 27 institutional members and associates, representing 24 European languages (including the widely spoken languages, such as English, French, German, and Spanish, as well as many less widely spoken languages, including those of Scandinavia and central Europe). The members now include ministries of education and other government departments, examination boards, cultural bodies, university departments, and language teaching organizations. In some cases, language testing consortia have been formed within ALTE, as for example, in Denmark for the testing of Danish, where the Ministry of Education now collaborates with Copenhagen University and the Studieskoleme. Despite the wide diversity that now exists within the membership, concern for quality and fairness continues to be a priority, as well as a focus on professional standards. The latest joint project, which is the topic of this article, is the development of a quality management system (QMS) for the ALTE members. The nature of the organization has only been covered briefly, and this is addressed elsewhere in publicly available documents in print and electronic formats,1 but the practical difficulties of establishing quality standards across diverse contexts is dealt with in subsequent sections, and the role of powerful organizations and their influence in such cases is also touched on in this article.

DEVELOPMENT OF THE ALTE CODE OF PRACTICE As early as 1991 it was agreed that a key objective of the association should be “to establish professional standards for all stages of the language-testing process,” the aim being to provide language learners with access to high-quality examinations that would help them to meet their life goals and have a positive impact on their 1See the ALTE Web site (w w w .a h e.o rg ) and handbook (Association of Language Testers in Europe [ALTEJ, 1998) for more information.

A C O D E O F P R A C T IC E A N D Q U A L IT Y

139

learning experience. Striving for fairness has been a unifying theme in the work of the association and the outcomes of the projects that have been carried out over the past decade. Without reference to technical jargon, most people have some idea of what fairness might mean in language examinations; for example, it is important for all candidates to be treated equally and for the results they achieve to be a fair reflection of what has been taught and learned. Many would also agree that the outcome, in terms of a certificate awarded, should be useful as a qualification and have a recognized value for study, work, or other purposes, (preferably both at home and in other countries). Although fairness has always been of concern in educational contexts, in recent years the concept of fairness has also received particular attention by testing professionals, alongside familiar psychometric concepts such as validity and reliability. Varying views of what fairness means have emerged, but typically these reflect the layman’s concern for the just treatment of groups and individuals in society and relate to issues such as access, equity, lack of bias, trustworthiness, and dependability. Within ALTE, the founding members agreed that it was important for both examination developers and examination users to follow an established code of practice that will ensure that the assessment procedures are of high quality and that all stakeholders are treated fairly. A code of practice of this kind had to be based on sound principles of good practice in assessment that allow high standards of quality and fairness to be achieved. The discussion of what constitutes good practice within ALTE has continued since then and reflects a concern for accountability in all areas of assessment that are undertaken by the ALTE members. In this respect it recognizes the importance of validation and the role of research and development in examination processes. In publishing its first Code o f Practice (Association of Language Testers in Europe [ALTE], 1994), ALTE set out the standards that members of the association aimed to meet in producing their language exams. It drew on The Code o f Fair Testing Practices in Education produced by the Joint Committee on Testing Practices, Washington, D.C. (1988) and was intended to be a broad statement of what the users of the ALTE examinations should expect and the roles and responsibilities of stakeholders in striving for fairness. The Code o f Practice (ALTE, 1994) identifies three major groups of stakeholders in the testing process; the examination developers (i.e., examination boards and other institutions that are members of ALTE); the primary users—the candidates, who take the examination by choice, direction, or necessity; the secondary users who are the sponsors of the candidates or who require the examination for some decision making or other purpose (e.g., parents, teachers, employers, admissions officers, etc.). In addition, the Code o f Practice lays down responsibilities of these stakeholder groups in four broad areas: developing examinations, interpreting examination results, striving for fairness, informing examination takers. An important feature is that it emphasizes the joint responsibility of the stake-

140

VAN A V E R M A E T , K U IJP E R , S A V ILL E

holders and focuses on the responsibilities of examination users as well as the examination developers in striving for fairness. In North America, in the fields of psychological and educational assessment, there has been a long tradition of setting standards, as reflected in the Standards fo r Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1999). Previous editions of these standards have been around since the 1950s and have had an influence on language test developers and test users, especially in the United States. They are regularly referred to in the language testing literature (see, for example, Bachman, 1990, and Bachman & Palmer, 1996). The latest edition includes several new sections that were not in the earlier version (1985), including part 2 of the volume entitled Fairness in Testing. The subsections in this part cover fairness in testing and test use, the rights (and responsibilities) of test takers, and the testing of individuals from different backgrounds or with disabilities. More specifically in the field of language assessment there has been a growing interest in this topic, including the development of codes of practice and an interest in ethical concerns. The International Language Testing Association (ILTA) conducted a review of international testing standards (1995) and later published its Code o f Ethics (International Language Testing Association [ILTA], 2000); this document presents a set of nine principles with annotations that “draws upon moral philosophy and serves to guide good professional conduct.” In 2000, a volume of articles edited by Antony Kunnan was published in the Cambridge Studies in Language Testing Series based on the 19th Language Testing Research Colloquium entitled Fairness and Validation in Language Assessment (Kunnan, 2000). This volume focuses on fairness in two sections: “Concept and Context,” and “Standards, Criteria and Bias.” Kunnan himself has been developing a “fairness framework” that seeks to integrate the traditional concepts of validity and reliability with newer concepts, such as absence of bias, fair access, and fair administration. His recent work in this area was presented at the ALTE Conference in Budapest in November 2001 (Kunnan, 2004) and formed part of a debate within ALTE on fairness and the ALTE Code o f Practice (ALTE, 1994). Although the ALTE Code o f Practice (ALTE, 1994) outlines its broad principles in 18 general statements, it provides little practical guidance to the practitioner on the implementation of these principles or how standards can be set and guaranteed. In attempting to address this issue, the ALTE members produced a supplementary document entitled Principles o f Good Practice fo r ALTE Examinations (drafted by Milanovic & Saville, 1993) that was discussed at ALTE meetings in 1992 and 1993 (Alcala de Henares, Paris, and Munich). This document was intended to set out in more detail what the principles might mean in terms of language testing practices that ALTE members should adopt to achieve their goal of high professional standards. The approach to achieving good practice was influenced by a number of sources from within the ALTE membership (e.g., test devel-

A C O D E O F P R A C T IC E A N D Q U A L IT Y

141

opment work being carried out by University of Cambridge Local Examinations Syndicate [UCLES] in Cambridge, England) and from the field of assessment at large. This included the work of Bachman and Palmer (1996) on “usefulness,” Messick (1989) on validity, and earlier versions of the American Psychological Association standards (e.g., 1985). ALTE members sought feedback on the document from external experts in the field, and it was discussed again in Arnhem, The Netherlands, in 1994. Although it was not published in its entirety, parts of the document were later incorporated into the Users Guide fo r Examiners (1997) produced by ALTE on behalf of the Council of Europe.2

ALTE QUALITY MANAGEMENT SYSTEM (QMS) In the late 1990s ALTE turned its attention to the question by reestablishing a code of practice Working Group to take this project forward.3 There was a feeling that despite the Code o f Practice (ALTE, 1994) and a focus on the underlying theoretical issues, there was still no way of systematically providing evidence of standards being met, and there was some criticism that ALTE itself had made little progress in establishing standards. In setting up a working party in 2000, ALTE members were keen that the group should represent the different perspectives of the expanded ALTE membership and the varying historical, cultural, and linguistic influences. To this end, members of the Working Group were selected from ALTE member organizations providing representation of a range of types of institution, including two examination boards, a university department, a ministry of education, and a cultural body. They also represent widely and less widely spoken languages and have varying degrees of experience in conducting language assessment on an international scale. In 2001 the Working Group met and reported on progress at the full ALTE meetings (in Perugia, May 2001, and Budapest, November 2001). Since then at the time of writing two substantive outcomes had been achieved: (a) a revised version of Principles o f Good Practice fo r ALTE Examinations (2001) has been produced and is being used by members; (b) the Code o f Practice (ALTE, 1994) itself has been redesigned and expanded as a checklist to be used by members as part of a QMS. The revised version of Principles o f Good Practice fo r ALTE Examinations (2001) based on the earlier version has been updated and reworked in many parts. 2This document has been renamed Language Examining and Test Development and is being published by the Council of Europe to accompany the Common European Framework o f Reference for Languages (Council of Europe, 2001), marking the European Year of Languages 2001. 3In the m id-1990s a number of other projects were a high priority for ALTE. including growth in membership (especially with new categories of membership), and the work on the ALTE Can Do Project provided an empirical basis for linking the ALTE levels to the Common European Framework.

142

V A N A V E R M A E T , K U IJP E R , S A V IL L E

It addresses in more detail the central issues of validity and reliability and looks at the related issues surrounding the impact of examinations on individuals and on society. This version, like the earlier drafts, draws on the revised American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (1999) standards, especially in the sections on validity and reliability, as well as the work of Bachman and Palmer (1996). As a working document it is likely that this will be revised further in light of feedback. The new dimension introduced into the work at this stage is the concept of total quality management (TQM) and quality management systems (QMS). QMS have been widely adopted in both manufacturing and service industries with the aim of improving quality to “meet customer requirements.” More recently the same approach has been adopted in many educational contexts across Europe, including within national education systems and by international, nongovernmental organisations.4 Customer focus is a common theme in both the business and educational contexts, and this is the same for ALTE as examination providers. The main customers of the ALTE members are the candidates who take the examinations (primary users) and who require high-quality assessment systems to be treated fairly. By adopting the quality management approach, it is hoped that the goal of “continuous improvement” will help to ensure that standards continue to rise and the requirements of the “customers” for the exams can always be met. The adoption of a quality management approach has led to the reworking of the Code o f Practice (ALTE, 1994) to reflect the practical aspects of assessment work within the ALTE membership. The aim is to establish workable procedures and programs of improvement that ultimately will be able to guarantee minimum quality standards based on the principles in the Code o f Practice. The approach to quality management that is being implemented is based on the following key concepts taken from the literature on quality management: the organization and self-assessment and peer monitoring. In this approach, it is important to identify the roles and responsibilities of key stakeholders in the ALTE institutions and to apply the system with flexibility according to the specific features of each organization (i.e., the different ALTE members and their stakeholder groups). In this phase of the project, the Working Group has been focusing on the different organizational factors within the ALTE membership and on the range of diversity that exists. In seeking to establish standards it is not the aim to make all ALTE members conform to the same models of assessment for all languages represented, and it is important to recognize the varied linguistic, educational, and cultural contexts within which the examinations are being developed and used. An appropriate balance is required between the need to guarantee professional standards to users, and the need to take into account the differing organizational features of the ALTE institutions and the contexts in which their exams are used. 4See, for example, the work of the European Foundation for Quality Management: www.efqm.org

A C O D E O F P R A C T IC E A N D Q U A L IT Y

143

The Working Group recommended that all members should attempt to identify their current strengths and the areas in need of immediate improvement within their own organization. On this basis, it would then be possible to establish the desired outcomes for both short- and long-term developments. The aim should be to set minimum acceptable standards, to establish “best practice” models, to aim at continuous improvement (move toward best practice). It is axiomatic in this approach that improvement is always possible. Therefore it in unlikely that, at any given time, any ALTE member will meet best practice in all areas (although many, if not all, will achieve minimum acceptable standards). The aim for all members should be to continue to share expertise and gradually to raise standards over time, that is, to aim at the best practice models through an ongoing process of development. In a QMS of this kind, standards are not imposed from “outside,” but are established through the mechanism of the system itself, and the procedures to monitor standards are based on awareness raising and self-assessment in the first instance. External (peer) monitoring is introduced at a later stage to confirm that the minimum standards are being met. In its current form the Code o f Practice (ALTE, 1994) has been reworked to function as an awareness-raising tool at this stage of the project. The redesigned format now reflects the four aspects of the test development cycle with which all ALTE members are familiar: examination development, administration of the examinations, processing of the examinations (including the marking, grading and issue of results), and analysis and postexamination review. These points were discussed at the ALTE meeting in Budapest in November 2001 where it was agreed to move to the next stage of applying the revised checklists. In January 2002 all members were required to complete the revised Code o f Practice checklists for at least one of their examinations and to send feedback to the Working Group for discussion and review at a Working Group meeting in March. The Working Group then reported back to the full membership in Saint Petersburg in April 2002. The COP Working Group now plays a central role within ALTE— fairness and quality assurance are high on the ALTE agenda. By proceeding in this way ALTE members are constantly aware of the different contexts in which they all work and the various backgrounds from which the members come, and ALTE is careful to respect these differences through its consultation processes and decision-making mechanisms. In the next parts of the article, two key issues that have emerged as part of this process are discussed. They relate to some of the fundamental features of a QMS as a means of improving standards. The first discussion covers self-assessment as part of an ongoing process of quality improvement; the second covers the question of minimum standards and how these should be arrived at. In both cases the question of how to achieve reconciliation between diversity and standards is discussed in relation to the many different organizations that now make up the ALTE membership.

144

V A N A V E R M A E T , K U IJP E R , S A V IL L E

QMS—A CONTINUOUS PROCESS OF SELF-EVALUATION AND QUALITY IMPROVEMENT It was noted in the introduction that ALTE is now an association with members from many countries in Europe, and it is not surprising therefore that there are big differences among the members with respect to the organizational, linguistic, educational, and cultural contexts within which the examinations are developed and used. Furthermore, within the institutions themselves there are wide differences in knowledge and traditions with respect to statistical and empirical issues in assessment, such as data gathering, data analysis, equating different examinations, and so forth. In the early discussions of the Working Group, these differences were looked at from the point of view of the different organizational types and the examination systems that are currently in place. They realized that introducing a system of quality control could be very threatening for some members who know for themselves that they do not meet high standards at the moment—particularly when compared with other members of the association. The approach to QMS that was adopted was therefore designed to lower anxiety and was meant to be a supportive tool. The aim was to allow members (a) to enhance the quality of their examinations in the perspective of fairness for the candidates; (b) to engage in negotiations with their senior management and sponsors in a process of organizational change, where necessary (e.g., to ensure that resources are made available to support ongoing improvements); and (c) to move from self-evaluation to the possibility of external verification to set agreed and acceptable standards. ALTE members have now accepted a time schedule that will lead them through the different stages of quality improvement from self-evaluation, through peer monitoring, to formal monitoring and meeting minimum standards. A very important aspect of this schedule is that a strong stress is put on the process of the quality enhancement and awareness raising and that the ultimate goal of a quality mark— or Q mark—is postponed and is a longer term objective. Involvement in the process and awareness of the importance of quality are actually necessary preconditions to making a QMS really workable. In the self-evaluation stage ALTE members have agreed to go through the following cycle: describing their examination development process by filling in the checklists; making judgments on the aspects of their examination development by rating aspects as “in need of improvement,” “adequate,” or “good practice”; setting priorities for enhancing the quality of aspects that are in need of improvement; filling in the checklist again, when improvements have been realized; going through this cycle again. This approach has so far led to a growing interest and the active involvement of all ALTE members. The activities carried out have already raised awareness among ALTE members of the strong and weak points in their own examinations, and this awareness, rather than being imposed from outside, is felt by the members themselves. The completion of the revised checklists in January 2002 meant that the Working Group was able to

A C O D E O F P R A C T IC E A N D Q U A L IT Y

145

carry out a preliminary trend analysis and a general overview of the quality of ALTE examinations as a whole. As described previously, the revised checklists exist in four main units or modules: test design and construction; administration—including the conduct of the exams; processing— including the marking, grading, and issue of results; and analysis and review of the examinations. The first trend analysis led to the following conclusions. As far as test design and construction are concerned, ALTE members in general follow standardized procedures in developing tests, mostly based on constructs of communicative language ability and through careful content descriptions of test items, tasks, and components. Despite all kinds of differences, this aspect of test construction generally meets minimum standards; that of course does not mean that improvement is not needed in some areas. The administration part reveals great organizational differences among the ALTE members, such as the number of test takers, whether the examinations are administered in-country or abroad, and so on. Many members, though, feel that their procedures have to be reconsidered and improved in the near future, and they have set themselves some short-term goals for this. Processing and analysis are the areas that many members feel are in the greatest need of improvement. In this respect ALTE can function as a forum that can provide opportunities for these areas to be discussed and improved through workshops, exchanges of methodologies, consultancies and peer monitoring. The self-evaluation approach has led to the agreement of all ALTE members that the future activities of members at ALTE meetings, and in workshops in between, will be centered around the issues that seem to be most in need of improvement based on the completed checklists. Furthermore, other issues coming from the Code o f Practice work will continue to be addressed at future ALTE meetings. Based on recommendations by the Working Group, the following areas have been identified: i. The development of routine procedures for data gathering, data entry, and data analysis: (a) ALTE members will have to gather data and analyze their examinations by means of pretesting or postexamination analysis to be able to demonstrate the quality of their examinations and the fairness of decisions made based on examination results; (b) These procedures will play a role in providing information on validity and reliability in relation to the use of tests. ii. The responsibilities of test developers for the social impacts of their tests: The discussions within ALTE as a consequence of introducing the Code o f Practice have already led to discussions about the responsibilities of test developers in the following areas: (a) tests in the context of immigration, and (b) tests for citizenship. iii. Dealing with candidates with disabilities or who require special considerations or “accommodations” in taking the examinations.

146

VAN A V E R M A E T , K U IJP E R , SA V IL L E

QMS AND MINIMUM STANDARDS: ISSUES OF CONTEXTUALIZATION AND DIVERSITY From the previous discussion, it is clear that in ALTE, as in the world of testing generally, the setting of standards is generally considered to be very important. But, as one knows, agreeing on which standards to set and then putting them into practice can be a tricky activity. Important questions are raised relating to the operation of “power mechanisms” and whether norms can be imposed or should be introduced through a process of negotiation. From one point of view, setting standards is rather like “making rules,” and therefore one has to be aware that power and influence come into play in this process. By taking their own norms as a point of reference, those who are in powerful positions are often able “to make the rules” and may exert undue pressure on others in doing so. But it could be argued that this need not be the case; as long as all the “partners” who are in the process of setting standards have the opportunity to discuss the norms and the potential differences in existing norms, each of the partners in the process can bring in his or her own institutional, national, and cultural arguments, and a whole range of considerations and possible constraints can determine the outcome. Four potential outcomes can be distinguished. One outcome of a discussion of this kind could be that partner X adopts or adapts to the norms of partner Y. A second option is partner Y adopting or adapting to the norms of partner X. A third outcome could be that both partners accept new, more neutral norms for both. And a fourth outcome of the discussion could be that the differences in norms remain but that each partner accepts these differences as meeting the necessary criteria. Each partner would maintain his or her norms as being valid for his or her own particular situation. Indeed this outcome could have a positive effect on the future cooperation between the partners. However, if any of the partners seeks to impose his or her norms on the others, and these norms then becomes the rule, one comes to the tricky part of the business. In principle, norms should be open to debate and set by consensus. Rules, however, may not be. A rule often takes on the status of something that is “set in stone” and is not up for discussion; once the rule has been made and has been established, it is often not felt appropriate to attack that rule. This can, of course, be considered a subversive power mechanism; we often forget that rules were set by people who at some time in the past used their own positions of power or authority to influence the outcome and used their own norms as a point of reference in setting those rules (Ullman-Margalit, 1977; Bartsh, 1987). ALTE wants to avoid a situation in which setting standards is created through these kinds of power mechanisms, and the work on QMS is part of the ALTE approach that seeks to prevent this from happening. There is a second reason why one has to be very cautious when setting standards. That is the attraction of those “in power” or who have the most prestige in the eyes of

A C O D E O F P R A C T IC E A N D Q U A L IT Y

147

the other members of the group. The “attraction mechanism” (Bourdieu, 1991) works as follows. If one supposes, as a member of the group that one can gain (at least symbolically) by adapting to rules that have been set out by the most powerful members of the group, there is a temptation to do so. One may try to bridge the gap between “us” and “them,” despite the institutional constraints or different cultural contexts that may exist. If, however, one supposes that the imposed standards or rules cannot be met, and one believes that by trying to bridge the gap one may lose out (again possibly in a symbolic way), one may be inclined to praise and defend one’s own norms. Maintaining one’s norms can be very good, as has already been said. However, if this is an irrational reaction against something one cannot achieve, the case is less obvious. In that case constructive discussion can be nipped in the bud. This process of attraction of the powerful elite has to be avoided. Being a member of an organization such as ALTE could potentially enforce this mechanism of “attraction of the powerful members.” But the Working Group within ALTE, in developing a QMS, has attempted to take these aspects into consideration. As stated before, the differences among the ALTE members are large with respect to organizational, linguistic, educational, and cultural contexts. For example, an international examination board such as UCLES has a long history, conducts English as a foreign language exams all over the world, and has many thousands of candidates a year. The Luxembourg examination board only has a limited number of candidates who take the exams within the country itself. There are also huge differences in knowledge and tradition with respect to all aspects of the examination cycle, including statistical and empirical issues, such as data gathering, data analysis, the equating of different examinations, and so forth. It makes no sense for the Luxembourg examination board to feel attracted to many of the standards of major examination boards (such as Cambridge ESOL [English for Speakers of Other Languages or CITO [Netherlands National Institute for Educational Measurement]), and they would not gain from it. But, despite these kinds of differences, like all the members of ALTE, they share a commonly felt need for fairness in their examination systems and recognize that sound principles must underpin their work. If it were to be the case that the cultural, logistic, psychometric, processing, and other aspects of the larger institutions were to function as the only point of reference for the QMS, such a system would become threatening for those members who know for themselves that they will not (currently or ever) be able to meet those standards. But even for members who feel “attracted” to the norms of the larger institutes and can meet them, adopting or adapting to their norms might have negative effects were the specific characteristics of their own institution to be affected. ALTE members do not wish to fall into the trap of believing that standards should be determined or imposed only by those who have the symbolic power and that the others have to be “attracted” by this and should try to achieve these standards at all costs. To avoid this, a lot of emphasis has to be put on the step-by-step development

148

VAN A V E R M A ET , K U IJP E R , S A V IL L E

of QMS and the concept of continuous improvement. On the other hand, at the end of that process one is striving to ensure that minimum standards will be in place based on the Code o f Practice (ALTE, 1994) and a shared understanding of sound principles of good practice. These will be standards that have been selected and determined on the basis of in-depth discussions with all members, where every member context has been taken into account. Of course institutions that are able to and want to go beyond these minimum standards will do so and will strive for best practice models. However, a quality label can be achieved when the exams of an institution meet the minimum standards. The consequence is that every member, after a period of self-assessment and peer monitoring, should be able to meet these minimum standards, and individual members have the opportunity to go beyond that minimum level. In this way, the variation that exists among all the ALTE members, including the different cultural, national, and institutional contexts in which they each work, can be taken into account and can be acknowledged. Some examples can clarify this perspective. One aspect of the Code o f Practice (ALTE, 1994) for external exam providers concerns the roles of examiners and teachers. Good practice would suggest that, for public examinations such as those offered by ALTE members, a teacher of a candidate should not also be the examiner. However, some smaller partners in ALTE sometimes have only a few candidates to examine in a country where there is also only one qualified teacher. For example, in 2001 the Certificate Dutch as a Foreign Language (CNaVT) had only one candidate in Argentina. In such a case it is inevitable that the teacher should also be the examiner, if the exam is to take place at all, one could argue. If the minimum standard were to be that the examiner can never be the teacher, the CNaVT will never be able to reach that standard—not because of the fact that it does not want to meet it, but because of its own institutional constraints. If the CNaVT, however, were to try to meet that standard adamantly, the benefits would not be in relation to the costs. This standard would not take into account the limitations of that partner. The minimum standard here can be that every member of ALTE has to strive for the test to be administered in a fair way by a person who is beyond all suspicion of bias or malpractice. There may be many suitable checks that could be carried out to ensure this. Another example concerns aspects of reliability, difficulty, and discrimination all considered to be important features of tests and which the test providers in ALTE should estimate for the components of their examinations. However, some countries do not have a tradition of systematically gathering item-level data to carry out statistical analyses; perhaps they should, but the reality is different. Some smaller members of ALTE do not have a specialist psychometric department within their organization to carry out this kind of work on a routine basis. This does not mean that these features are not recognized as important and to be accounted for in some way. A minimum standard for reliability should be that the reliability has to be estimated and reported in an appropriate way. How this is done may depend on the potential of the organization to carry out data collection and analysis.

A C O D E O F P R A C T IC E A N D Q U A L IT Y

149

If institutions have the possibility to do more advanced analyses, this of course can only be encouraged and should be an objective. A third example is about the pretesting of exams. Larger institutions often have the possibility to do pretesting and analysis every time they construct a new examination paper. They have enough staff and financial resources to do it, and there is a target group of candidates that is large enough to make up an appropriate sample of candidates to do the pretesting. Smaller members often do not have these possibilities. A minimum standard cannot be that pretesting is a precondition, for example, to investigate item characteristics and to ensure that “differences in performance are related to the skills under assessment rather than irrelevant factors.” A minimum standard in this case should be that adequate procedures have to be provided and described to ensure that differences in performance are related to the skills under assessment rather than irrelevant factors. This may be done, for example, at different stages of the examination cycle, starting with careful test design and task construction and then carrying out appropriate postexam analysis and grading procedures. These are a few examples of how minimum standards can be set and, at the same time, respect for the variation among the members maintained.

CONCLUSION The ALTE Code o f Practice (1994) has been incorporated into a QMS, and thus it is developing from a theoretical document into a very practical tool for enhancing the quality of the ALTE examinations. Self-evaluation and peer monitoring turn out to be crucial for real acceptance of the QMS by all ALTE members. This is a necessary precondition for continuing the process into the stage of setting minimum standards that are not just externally imposed and threatening, but accepted by the ALTE members as necessary to have examinations that are fair for the test takers. The QMS functions as a tool for members to enhance the quality of their examinations in the perspective of fairness for the candidates. It can also function as a tool for discussions and debates among partners about the quality and the aspects of fairness of the different procedures and steps in the running of exams. And finally it can also be a tool to open up discussions and negotiations with the funders of examination providers. It is often increased funding and other aspects of organizational change that will lead to greater possibilities of increasing quality and raising standards. From the discussion of these issues, it is clear that the world of language testing is far more than abstract concepts. Above all it is about people, and, in the case of ALTE, the full range of stakeholders that make up the ALTE “organizations” and constituencies of test users. In testing, as in the world at large, one sometimes looks through “tinted glasses” and only sees the concepts and constructs. People are too easily ignored or even condemned when only the concept is seen—for example, focusing on reliability or validity as concepts, on immigrants as a concept, or even

150

V A N A V E R M A E T , K U IJP E R , SA V IL L E

on standards as a concept. These concepts are often approached as monolithic and static systems, and even in this era of postmodernism, one can forget the people behind the concepts. When thinking about testing, dealing with test construction, setting standards, and so on, a perspective has to be taken that incorporates and takes into account the dynamics of cultural and social diversity. It is also clear that testing is often about power and politics, and to some extent this will always be true— for example, when a government or other body sets up a testing policy and gives a mandate to a test developer to produce a test. It is the case for test construction, for the use-misuse, the interpretation-misinterpretation, and the social consequences of it. One has to be aware of the power mechanisms that are in play at every step that is taken. From a perspective of fairness one has to try to avoid falling into these traps, as ALTE is trying to do with the setting of minimum standards.

REFERENCES ALTE (1994). The ALTE Code o f Practice. Retrieved from http://www.alte.org/quality_assurance/ index.cfm American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (1999). Standards fo r educational and psychological testing. Washington, DC: American Educational Research Association. Association of Language Testers of Europe (1994). Code o f practice. Cambridge, England: Author. Association of Language Testers of Europe (1998). Handbook o f language examinations and examination systems. Cambridge, England: Author. Association of Language Testers of Europe, & Council of Europe (1997). Users guide fo r examiners. Cambridge, England: Association of Language Testers of Europe. Bachman L. F. (1990). Fundamental considerations in language testing. Oxford, England: Oxford University Press. Bachman, L. F„ & Palmer, A. (1996). Language testing in practice. Oxford, England: Oxford University Press. Bartsch, R. (1987). Norms o f languages, theoretical and practical aspects. New York: Longman. Bourdieu, P. (1991). Language and symbolic power. Cambridge, England: Polity. Council of Europe (2001). Common European Framework o f Reference fo r Languages: Learning, teaching, assessment. Cambridge, England: Cambridge University Press. International Language Testing Association (1995). Task Force on Testing Standards. Melbourne, Australia: Author. International Language Testing Association (2000). Code o f ethics. Vancouver, Canada: Author. Joint Committee on Testing Practices (1988). The Code o f Fair Testing Practices in Education. Washington, DC: Author. Kunnan, A. J. (Ed.). (2000). Fairness and validation in language assessment. Cambridge, England: Cambridge University Press. Kunnan, A. J. (2004). Test fairness. In M. Milanovic & C. Weir (Eds.), European Language Testing in a Global Context (pp. 27^-8). Cambridge, England: Cambridge University Press. Messick, S. (1989). Validity. In R. Linn (Ed.), Educational measurement. (3rd ed„ pp. 13-113). New York: ACE/Macmillan. Milanovic, M., & Saville, N. (1993). Principles o f good practice fo r ALTE examinations. Manuscript in preparation. Ullman-Margalit, E. (1977). The emergence o f norms. Oxford, England: Clarendon.

LANGUAGE ASSESSMENT QUARTERLY, A2&3), 151-160 Copyright © 2004, Lawrence Erlbaum Associates, Inc.

The Role of a Language Testing Code Ethics in the Establishment of a Code of Practice Randy Thrasher

Okinawa Christian Junior College

This article examines the relation between a language testing code of ethics and a code of practice by discussing the development of the Japan Language Testing Association (JLTA) draft Code o f Practice. It claims that the relation between the two sorts of codes is not as straightforward as seems to be assumed by the authors of the International Language Testing Association (ILTA) Code o f Ethics. But it argues that a code of ethics is useful in deciding to whom a code of practice should apply and in justifying the inclusion of the various elements in the code of practice. One reason JLTA had for demanding that our Code o f Practice apply to both high- and low-stakes testing is rooted in the ILTA Code o f Ethics— the demand that all test takers be treated as fairly as possible. It is also claimed that the purpose of a code of practice, in Japan at least, is not to discipline members of our professional organization but to hasten the day when testing practices that treat test takers unfairly are replaced by those a consensus of professional language testers believe are fairer.

In this article I would like to explore the relation of a code of ethics and a code of practice using the ILTA Code o f Ethics (ILTA Code o f Ethics, 2000) and the first draft of a code of practice (JLTA Code o f Practice, 2001) being developed by the Japan Language Testing Association (JLTA). I believe that this relation is a crucially important issue for ILTA. As is well known, ILTA had tried for several years to draft a language testing code of practice but was unable to reach agreement. So the decision was made to draft a code of ethics and then later try to establish a code (or codes) of practice. At the Language Testing Research Colloquium (LTRC) 2000, in Vancouver, British Columbia, Canada, the draft ILTA Code o f Ethics was presented and accepted by the membership. Soon after, at the direction of then Requests for reprints should be sent to Randy Thrasher, Okinawa Christian Junior College, 111 Onaga, Nishihara-cho, Okinawa 903-0207, Japan. E-mail: [email protected]

152

THRASHER

ILTA president, Alan Davies, two committees were set up, one under the chairmanship of Lyle Bachman and the other with me as chair set up by the JLTA.

RELATION BETWEEN A CODE OF ETHICS AND A CODE OF PRACTICE It may seem at first glance that the connection between a code of ethics and a code of practice is straightforward. The code of ethics states the broad ethical demands and the code of practice spells out in practical terms how those ethical demands are to be satisfied in the day-to-day work of language testers. This certainly was the view of the writers of the ILTA Code o f Ethics (2000). They stated that The Code of Ethics is instantiated by the Code of Practice (currently under preparation by ILTA). Although the Code of Ethics focuses on the morals and ideals of the profession, the Code of Practice identifies the minimum requirements for practice in the profession and focuses on the clarification of professional misconduct and unprofessional conduct. (ILTA Code o f Ethics, p. 15)

The authors clearly believed that the role of a code of practice was to state the practical implications of a code of ethics. And, I must admit that, for a few of the elements in the ILTA Code o f Ethics (2000), it could be claimed that there is a need to spell out the demands of the code in practical terms in a code of practice. The first principle is one such element. It states that, Language testers shall have respect for the humanity and dignity of each of their test takers. They shall provide them with the best professional consideration and shall respect all persons needs, values, and cultures in the provision of their language testing services. (ILTA Code o f Ethics, p. 15)

It could be argued that developing tests that are valid and reliable is necessary to “provide the best professional consideration.” But it is much harder to see how Principle 5, “Language testers shall continue to develop their professional knowledge, sharing this knowledge with colleagues and other language professionals” (p. 19), could be spelled out in a code of practice. In fact, there seems to be no need to do so. And because the annotation to Principle 4, “Language testers shall not allow the misuse of their professional knowledge or skills, in so far as they are able’Xp. 18), indicates that this principle is not concerned with the misuse of test results so much as prohibiting language testers from participating in unethical or immoral activities or projects, it is also difficult to see how this principle could be instantiated in a code of testing practice.

T H E R O L E O F A L A N G U A G E T E S T IN G C O D E E T H IC S

153

If it is the case, as I have tried to point out, that there is no simple relation between a code of ethics and a code of practice, we are faced with several possibilities. We can decide that there is no useful relation between the two sorts of codes and go ahead and try to build a code of practice without reference to the ILTA Code o f Ethics (2000). Or we can assume that such a relation exists and begin to search for the ways the two types of codes are connected. It is this second stance that I adopted in trying to develop a code of practice that would fit the situation in Japan. In part this decision was based on my belief that there should be a connection and in part on the great need to justify the code of practice that we were trying to build.

ROLE OF A CODE OF ETHICS From our experience (the JLTA) trying to develop a code of practice, I have found two areas in which the ILTA Code o f Ethics (2000) has been important in our work. It has made us take a particular stance on both to whom the code should apply and how it might be structured. And it has provided us with a way of justifying the elements of the code that we have proposed. Let me deal with each of these in turn.

To Whom a Code of Practice Should Apply A code of ethics should be universal—at least within the field for which it is designed to cover. But the creation of a universal code of practice may not be possible. Most of the discussion as to why such a universal code may or may not be possible has focused on cultural differences. This is certainly an issue in Japan. But there is another aspect of the universality question. This can be seen most clearly in the different approaches of the National Council on Measurement in Education (NCME) and the Joint Committee on Testing Practices of the American Psychological Association (APA). The NCME states, “The purpose of the Code o f Professional Responsibilities in Educational Measurement (Schmeiser, Geisinger, Johnson-Lewis, Roeber, & Schafer, 1995) is to guide the conduct of NCME members who are involved in any type of assessment activities in education” (p. 1). However, the APA takes a different approach. The Code of Fair Testing Practices in Education states the major obligations to test takers of professionals who develop or use educational tests. The Code is meant to apply broadly to the use of tests in Education (admissions, educational assessment, educational diagnosis, and student placement). The Code is not designed to cover employment testing, licensure or certification testing, or other types of testing. Although the Code has relevance to many types of educational tests, it is directed pri-

154

THRASHER

marily at professionally developed tests such as those sold by commercial test publishers or used in formally administered testing programs. The code is not intended to cover tests made by individual teachers for use in their own classrooms. (Joint Committee on Testing Practices, 1988, p. 9)

Clearly the APA is taking a more narrowly focused approach, and there are good practical reasons for this stance. But this approach raises an ethical issue. The APA stance (Joint Committee on Testing Practices, 1988) gives the impression that only test developers creating what the authors of the code call “professionally developed tests” need worry about building measurement devices that meet the ethical standards set by the association. The first draft of the Code o f Practice of the JLTA (Thrasher, 2001) attempts to avoid both the problem of appearing to claim that only certain test developers need to follow ethical standards and the practical difficulty of coming up with a code that fits all language testing situations by trying to state rules that apply to all language testing and then stating additional rules (or stricter versions of general rules) that are to be followed in commercial and other high-stakes testing situations. There are two reasons for demanding that the code apply to every testing situation. The first is ethical. All test takers, whether facing a high- or low-stakes test should be treated as fairly as possible. The amount of evidence needed to demonstrate the validity of a classroom quiz will be less than that required of a commercially available examination used to make significant decisions in the lives of the test takers. The degree of rater reliability acceptable in the grading of an in-class essay does not need to be as high as that of the Test of Written English (TWE, published by the Educational Testing Service). But it does not follow from this that classroom teachers should not be concerned about validity and reliability. The other reason for demanding that the code of practice cover all language testing is a practical one. If we are going to upgrade testing practice (in Japan, at least), we must get all test-writing teachers and the whole test-taking public to understand what good testing practice is. In Japan, it is usually the case that the classroom teachers write not only the tests for their classes but also those for high-stakes decisions such as deciding entrance to their school. Therefore, we do not have a chance of changing the quality of entrance examinations unless we first develop a better understanding of good testing practice among classroom teachers. We cannot expect them to wear an ethical testing hat when developing entrance exams if they have not already learned how to wear that hat in writing the tests they use in their classrooms. The same argument can be made from the students’ point of view. We need to help them to understand what good testing practice is and have them begin to demand that the tests they must take reflect such practice. Yet, there is little that test takers can do about the high-stakes tests they must take. But they can speak up when classroom tests seem lacking in validity or reliability. The identity of en-

T H E R O L E O F A L A N G U A G E T E S T IN G C O D E E T H IC S

155

trance exam writers is a closely guarded secret, but the students know who to complain to about poor classroom tests.

JLTA Draft Code of Practice1 For these reasons I have proposed a six-part structure for the JLTA Code o f Practice: (a) basic considerations for good testing practice in all situations, (b) responsibilities of test designers and test writers, (c) obligations of institutions preparing or administering high-stakes exams, (d) obligations of those preparing and administering commercially available exams, (e) responsibilities of users of test results, and (f) special considerations. The first section contains elements such as the following basic considerations for good testing practice in all situations: 1. The test developer’s understanding of just what the test, and each subpart of it, is supposed to measure (its construct) must be clearly stated. 2. All tests, regardless of their purpose or use, must be valid and reliable to the degree necessary to allow the decisions based on their results to be fair to the test takers. 3. Test results must be reported in a way that allows the test takers and other stakeholders to interpret them in a manner that is consistent with their meaning and degree of accuracy. The second section includes responsibilities of test designers and test writers: 1. A test designer must begin by deciding on the construct to be measured before deciding how that construct is to be operationalized. 2. Once the test tasks have been decided, their specifications should be spelled out in detail. 3. The work of the item writers needs to be edited before the items are pretested. If pretesting is not possible, the items should be analyzed after the test has been administrated but before the results are reported. Malfunctioning or misfitting items should not be included in the calculation of individual test takers’ reported scores. Although each element of the ILTA Code o f Ethics (2000) begins with “language testers” and thereby seems to be directed only to individual test writers and developers, the third section of the draft JLTA code is aimed specifically at institutions. The reason for laying part of the responsibility for good testing practice on institutions stems from the experience of many JLTA members of being the only 'This article presents only examples from the JLTA C o d e o f P ractice. The complete draft of this code is available on the JLTA website at http://www.avis.ne.jp/~youichi/CORhtml

156

THRASHER

testing professional on a committee set up to write a high-stakes test. We realize that in arguments in such committees we need to be able to say that good testing practice is the responsibility of not only the test writers but also the institutions who develop, administer, and use the test results for important decision making.

Obligations of Institutions Preparing or Administering High-Stakes Exams Responsibilities to test takers and related stakeholders:

Before the test is administered. The institution should provide all potential test takers with adequate information about the nature of the test, the construct (or constructs) the test is attempting to measure (Ideally this should include any evidence and arguments showing that the test tasks are in fact measuring what they are claimed to measure), the way the test will be graded, and how the results will be reported. At the time of administration. The institution shall provide facilities for the administration of the test that do not disadvantage any test taker. The administration of the test should be uniform, assuring that all test takers receive the same instructions, time to do the test, and access to any permitted aids. If something occurs that calls into question the uniformity of the administration of the test, the problem should be identified, and any remedial action to be taken to offset the negative impact on the affected test takers should be promptly announced. At the time of scoring. The institution shall take the steps necessary to see that each test taker’s exam paper is graded accurately and the result correctly placed in the database used in the assessment. There should be ongoing quality control checks to assure that the scoring process is working as intended. Other considerations. If a decision must be made on candidates who did not all take the same test or the same form of a test, care must be taken to assure that the different measures used are in fact comparable. Equivalence must be demonstrated statistically. If more than one form of the test is used, interform reliability estimates should be published as soon as they are available. The decision to include a section directly addressed to developers of commercially available tests reflects the need to have some influence on the huge language testing business in Japan. It is these tests that set the public’s understanding of what a test is, how tests should be administered, and test results reported. Unless the companies producing such tests can be convinced to use good testing practices, the

T H E R O L E O F A L A N G U A G E T E S T IN G C O D E E T H IC S

157

public will continue to believe that the practices these companies presently use are good (or, at least, acceptable) testing procedures.

Obligations of Those Preparing and Administering Commercially Available Exams In addition to the obligations placed on any test designer and on those preparing high-stakes examinations, developers and sellers of commercially available examinations must: 1. Make a clear statement as to what groups the test is appropriate for and for which groups it is not appropriate. 2. Publish validity and reliability estimates for the test along with sufficient explanation to allow potential users to decide if the test is suitable in their situation. 3. Report the results in a form that will allow the test users to draw correct inferences from them and make them difficult to misinterpret. 4. Refrain from making any false or misleading claims about the test. The fifth section of the proposed code deals with the use of test results. The reason for its inclusion is to focus on the importance of using test results properly (As Messick (1996) has pointed out, the crucially important element in test validity is the inferences drawn on the basis of the test results) and to help test-using institutions to both improve their test-based decision making and put pressure on test providers to give their clients the sort of results needed to make meaningful decisions.

Responsibilities of Users of Test Results Persons who use test results for decision making must:1 1. Use results from a test that is sufficiently reliable and valid to allow fair decisions to be made. 2. Make certain that the test construct is relevant to the decision to be made. 3. Clearly understand the limitations of the test results on which they will base their decision. 4. Take into consideration the standard error of the mean (SEM) of the device that provides the data for their decision. 5. Be prepared to explain and provide evidence of the fairness and accuracy of their decision-making process.

158

THRASHER

The last section of the code is an et cetera category that attempts to addresses issues that did not seem to fit nicely into the other sections.

Special Considerations In norm-referenced testing. The characteristics of the population on which the test was normed must be reported so that test users can determine if this group is appropriate as a standard to which their test takers can be compared. In criterion-referenced testing. (1) The appropriateness of the criterion must be confirmed by experts in the area being tested; (2) because correlation is not a suitable way of determining the reliability and validity of criterion-referenced tests, methods appropriate for such test data must be used. In computer-adaptive testing. (1) The sample sizes must be large enough to assure the stability of the Item Response Theory (IRT) estimates; (2) test takers and other stakeholders must be informed of the rationale of computer-adaptive testing and given advice on test-taking strategies for such tests.

CODE OF ETHICS AS JUSTIFICATION OF POINTS IN A CODE OF PRACTICE The motives of the builders of a code of practice can be called into question just as Beyerstein, quoted in Boyd and Davies, claimed that “codes of ethics exist primarily to make professionals look moral” (2001, p. 10). But there are legitimate motives for trying to establish a code of practice, and these motives are rooted in the ILTA Code o f Ethics (2000). It seems to me that we need such a code because the present practices treat the test takers unfairly. That is, they do not show “respect for the humanity and dignity of each ... test taker” (Principle 1 in the ILTA Code o f Ethics, 2000) and what is happening (in Japan, at least) does “advance purposes inimical to ... test takers interests” (first annotation to Principle 4). The second annotation to Principle 6 indicates the societal responsibility of our profession to work for beneficial change. “Language testers develop and exercise norms on behalf of society” (p. 20). And the third annotation to Principle 7 states the goal of a code of practice—to ensure “that language testing test takers have available to them the best possible testing service” (p. 21). I believe that we can reach such a goal if we put the test taker at the center of our concerns. By demanding ethical treatment of test takers, we can not only justify the rules we put into a code of practice but also move toward the day of making the best possible testing service a reality.

T H E R O L E O F A L A N G U A G E T E S T IN G C O D E E T H IC S

159

BROADER ISSUES In the various discussions of the development of a code of good testing practice, three major questions have emerged. One is the practical issue of what should be included in such a code. Another is how to enforce such a code, and a third concerns the possibility of a universal code of practice. This last issue is one that I feel strongly about but must admit I am not in a position to do more than state my belief that a universal code should be possible. However, I believe that my experience in helping develop a code of practice for the JLTA (Thrasher, 2001) allows me to say a bit more about the first two issues. The issue of what to include in a code of good testing practice is often seen as the question, can language testing professionals agree on what should go into such a code? Because my experience in the field dictates a “no” answer to this question, I believe we have to settle for broad consensus rather than 100% agreement. And we must also give up the idea that a code can be produced that we will be able to set in stone. A code of practice will always be a work in progress. The code will have to be revised with the deepening of our knowledge of testing, developments in statistics, and our experience with the use of the code as originally stated. The authors of the ILTA Code o f Ethics (2000) made the same point for their code, and a code of practice will probably need to be updated even more often than a code of ethics. The issue of the enforcement of a code of practice is inevitably linked to what one sees as the purpose or function of such a code. Some seem to view a code of practice as similar to a legal code and expect that violators can be punished for unprofessional conduct. It is impossible to tell for sure, but it appears to me that this sort of thinking may be behind the words in the ILTA Code o f Ethics: “While the Code of Ethics focuses on the morals and ideals of the profession, the Code of Practice identifies the minimum requirements for practice in the profession and focuses on the clarification of professional misconduct and unprofessional conduct” (2000, p. 15). I am not convinced that this should be the proper role of a code of practice in our field, but even those who would like to have a code that could be used in this way must admit that it is not possible in our present situation. Language testers are not licensed and associations such as the ILTA and the JLTA, as they are presently constituted, have few if any procedures for disciplining members. And what is probably the most important reason for the impossibility of a code of practice as a means of disciplining test developers and users is that a large number of the developers and the bulk of the users are not members of any professional testing association. This leaves us with no alternative but to present the code of practice as the consensus view of professional language testers and urge test developers and users to follow it.

160

TH RASHER

CONCLUSION The attempt by the JLTA (Thrasher, 2001) to draft a code of good testing practice based on the ILTA Code o f Ethics (2000) has raised important issues concerning the connection between such codes. I believe that it has shown that a code of practice is not simply the spelling out of each principle of the code of ethics. But there is, I have argued, a strong and necessary connection between the two sorts of codes. In the case of the JLTA Code o f Practice (2001), the ILTA Code o f Ethics has provided both the rationale for including the various points of the code and the moral ground to demand that the code of practice apply to all language testing, not just high-stakes exams. But I have also pointed out that the development of the JLTA Code o f Practice and considerations of how it can be used in this society argue against the view reflected in two statements in the prologue to the ILTA Code o f Ethics that indicate that failure to live up to the letter of such codes should result in what are called “serious penalties” (p. 14). Our experience in Japan has pushed us to take the position that the role of both the code of ethics and particularly the code of practice should be primarily educational and be an indication that testing practice has the seal of approval of language testing professionals. I believe that it would be wise for ILTA to monitor the level of acceptance of the JLTA Code of Practice to see if some statements in the prologue of the ILTA Code o f Ethics might need to be modified.

REFERENCES Boyd, K„ & Davies, A. (2001). Doctors’ orders for language testers: The origin and purposes of ethical codes. L a n g u a g e Testing, 19, 296-322, ILTA C o d e o f E th ic s (2000). L a n g u a g e Testing U pdate, 27, 14-22. Japan Language Testing Association. (2003, January 20). JLTA C o d e o f P ractice. Retrieved January 28, 2004, from http://www.avis.ne.jp/~youichi/COP.html Joint Committee on Testing Practices. (1988). C o d e o f F air Testing P ra ctices in E du ca tio n . Washington, DC: Author. Messick, S. (1996). Validity and washback in language testing. L a n g u a g e Testing, 13, 241-256. Schmeiser, C., Geisinger, K., Johnson-Lewis, S., Roeber, E., & Schafer, W. (1995). C o d e o f P ro fe ssio n a l R e sp o n sib ilitie s in E d u c a tio n a l M ea su rem en t. Washington, DC: National Council on Measurement in Education. Thrasher, R. (2001, October). Id ea s f o r a JLTA co d e o f testin g p ra c tic e. Paper presented to the Japan Language Testing Association annual meeting, Tokyo, Japan.

LANGUAGE ASSESSMENT QUARTERLY, / ( 2&3), 161-176 Copyright © 2004, Lawrence Erlbaum Associates, Inc.

Using the Modern Language Aptitude Test to Identify a Foreign Language Learning Disability: Is it Ethical? 1 Daniel J. Reed

Indiana University and Second Language Testing, Inc.

Charles W. Stansfield

Second Language Testing, Inc.

In this article we identify and discuss five ethical concerns related to the use of the Modern Language Aptitude Test (MLAT) in the diagnosis of foreign language learning disability (FLLD). In the introduction, we provide some background information on the MLAT, describe its use in learning disability (LD) assessment, and introduce the 5 ethical concerns that we have identified. We also elaborate on the definitions of language aptitude and learning disability to facilitate the subsequent discussion of the ethical issues. In the second part of the article, we examine the 5 ethical concerns in some detail. In the process, we observe that resolution of these types of problems is a complex matter that requires the understanding of viewpoints in a surprisingly large Requests for reprints should be sent to Charles Stansfield, Second Language Testing Inc., 10713 Mist Haven Terrace, N. Bethesda, MD 20852. E-mail: [email protected] 'In this article, we accept a priori the existence of individual differences in aptitude and therefore in the rate of language learning and in the probability of success at language study. Indeed, the identification of individual differences is the basis for nearly all measurement, including measurement of achievement, proficiency, personality, attitudes, and aptitudes. If a special aptitude for language learning exists, we believe it is worthwhile to construct a test to measure it. Misuses of any test are possible, and some potential misuses of the MLAT are explored here. However, the potential for misuse does not make it unethical to develop a test or to publish it. Prohibiting the development of a test because someone might misuse it would constitute a form of prior restraint. It would place an unreasonable restriction on a number of basic freedoms, including freedom of research, freedom of publication, and freedom of teaching. In addition, the benefits of the test would be lost to those who would benefit as a result of the information it provides. Although the use of tests is at the center of discussions of ethics in testing, ethical concerns about test use do not negate the presence of individual differences. If individual differences are present, then it is the task of measurement specialists to measure them accurately, and it is incumbent on test users to learn how to interpret scores appropriately. It is the failure to do so that is unethical.

162

R E E D A N D S T A N S F IE L D

number of specialized areas including FL education, LDs, clinical neurology, school psychology, and law. In our conclusions, after summarizing our findings related to the 5 ethical issues, we suggest that use of the MLAT in the identification of individuals with a FLLD is ethical if adequate safeguards are in place and if relevant professionals approach the task in a discerning way.

The MLAT is a well-known, commercially available assessment instrument that measures a person’s probable ability to learn a foreign language (FL). Harvard psychologist John Carroll and linguist Stanley Sapon developed it during a 5-year research project in the 1950s. The MLAT is based on experimental tests tried out on about 5,000 persons. It was normed on about 2,900 students in U.S. high school, college, and government language programs. Some of the results of the field test and norming administrations are reported in Carroll (1981) and in the MLAT manual (Carroll & Sapon, 2002). The general conclusion from the data gathered during the development process was that the MLAT is a potentially very useful instrument for predicting success at FL learning, especially if other relevant considerations, such as motivation, attitude, effort, and quality of instruction, are taken into account. For more than 40 years, the main uses of the MLAT have been in the selection of students for FL study and in the placement of students in programs that have curricular options such as streaming and matching. These uses are described in some detail in the MLAT manual (Carroll & Sapon, 2002). Psychologists recognized the potential diagnostic value of the test early on because different parts of the test deal with different aspects of language ability, such as phonetics, phonology, grammar, and rote memory. Guidance counselors have used scores from different parts of the test in building profiles of students’ strengths and weaknesses and in advising them how best to attack their curriculum (cf. Ehrman, 1996; Wesche, 1981). One practical way to use MLAT scores is to construct “expectancy tables” based on locally established norms, an approach advocated in the MLAT manual (Carroll & Sapon, 2002, pp. 15-16). Table 1, taken from the MLAT manual, illustrates one form of expectancy table. Although the number of cases in this table is small, it is clear that students scoring 69 and below have virtually no chance of earning a better grade than D or F in first-year Spanish. Two related questions that are addressed in this article are as follows: Is it ethical to force children to take a FL course knowing that they most likely will fail it? Also, is it ethical to fail to assess FL aptitude when diagnosing foreign language learning disability (FLLD)? Relatively recently, there has been an increase in the use of the MLAT in the diagnosis of individuals who claim to have a learning disability (LD) that severely impairs their learning of FLs. Contingent on the outcome of such a diagnosis is eligibility for accommodations in FL courses and, in some cases, outright waivers from the FL requirement. Ethical concerns regarding this high-stakes use of the

U S IN G T H E M O D E R N L A N G U A G E A P T IT U D E T E S T

163

TABLE 1a Illustrative Expectancy Table Showing Relation Between MLAT Total Scoresb and Course Grades B o y s (N = 3 8 ) (r = .7 8 ) % R e c e iv in g E ach G rade R a w S co re

N

100+ 88-99 70-87 69 and below

11 7 12 8

A& B

73 29 17 0 G irls (N = 3 2 ) (r =

R a w S co re

N

100+ 88-99 70-87 69 and below

17 3 8 4

N ote.

C '

27 57 42 0

D& F

0 14 42 100

.73) % R e c e iv in g E ach G rade

A& B

c

D& F

82 0 25 0

12 33 38 0

6 67 38 100

C riterio n : F irst-y e a r g ra d e s in S p a n ish , G ra d e 11.

“Appears as TABLE 9 in the MLAT manual. bMLAT administered at beginning of course.

MLAT are the subject of this article. We have identified five associated ethical questions, which follow: 1. Is it ethical to grant some students a waiver of the FL requirement, although others are forced to take a FL against their will? 2. Is it ethical to use the test to determine whether a FL requirement waiver should be granted when a student can intentionally fail the test? 3. Is it ethical to force students with a FLLD to take and fail a FL one or more times before granting a waiver? 4. Is it ethical to exclude a measure of language aptitude when a student is being evaluated for learning disabilities, particularly if the student is petitioning specifically for an exemption from a FL requirement? 5. How can the test publisher market and sell the test in an ethical way, knowing that its use is open to misuse, or even abuse? Central to a discussion of these questions is an understanding of two terms that appear in them. One such term is language aptitude. Carroll (1981) explained the idea of language aptitude with several interesting statements. He stated that In approaching a particular learning task or program, the individual may be thought of as possessing some current state of capability of learning that task. ... [and] ...

164

R E E D A N D S T A N SF IE L D

That capability is presumed to depend on some combination of more or less enduring characteristics of the individual, (p. 84)

Thus, aptitude as conceived by Carroll “does not include motivation or interest; these latter aspects have to be separately evaluated” (p. 84). In his writings Carroll stated that language aptitude is “relatively fixed” and “relatively hard to modify.” Furthermore, he suggested that foreign language aptitude is not exactly the same as what is commonly called “intelligence,” not even “verbal intelligence,” for foreign language aptitude measures do not share the same patterns of correlations with foreign language achievement as intelligence and academic ability measures have. (p. 86)

In addition, Carroll emphasized that “people differ widely in their capacity to learn foreign languages easily and rapidly” (p. 97) and that “aptitude should be defined in terms of prediction of rate of learning” (p. 91). On the basis of extensive correlational and factor analyses, Carroll (1981) proposed that FL aptitude consisted of four components; phonetic coding ability, grammatical sensitivity, rote memory, and inductive language learning ability. The first three of these components are tapped by the MLAT. Extensive norming and validity data for the test are presented in the MLAT manual (Carroll & Sapon, 2002) based on administrations of the test to roughly 2,900 students in U.S. high schools, colleges, and universities. Carroll (1981) summarized the validity data as follows: The predictive validity coefficients for Foreign language aptitude batteries in representative samples are typically in the range .40 to .60 against suitable criterion measures of success in foreign language attainment, such as final course grades, objective foreign language attainment tests, or instructors’ estimates of language learning ability. ... It can be said, in fact, that foreign language success is more easily and better predicted, on the basis of aptitude test scores, than most other types of achievement, (p. 96)

Furthermore, although some may think of the MLAT as related to the “audio-lingual” method of language training, Ehrman (1998), in a study involving about 1,000 students in government language programs, showed that the level of validity was about the same whether the method of language learning was “audio-lingual” or one of the more contemporary and communicative approaches. Another term that it is important to understand is learning disability. We have somewhat less to say about this notion, partly because we are not experts in the field of learning disabilities or special education, and partly because there is a lack of consensus about many aspects of the notion, including how many learning disabilities exist, what their basis is, and how their presence in an individual can be

U S IN G T H E M O D E R N L A N G U A G E A P T IT U D E T E S T

165

properly identified and diagnosed. However, according to Kavale (1993), there is at least a fair amount of agreement on some components of learning disabilities, “one of which is the presence of academic deficits (e.g., in reading, writing, math), which are the most overt manifestations of underlying information-processing problems” (p. 520). Cohen (1983) stated that “it is a diagnosis primarily made by exclusion and discrepancy” and explained that Most clinicians and researchers accept the notion that there are neuropsychologically based cognitive deficits that interfere with learning. To call such a deficit a learning disability it is essential to determine that the following factors are not causing the cognitive deficit(s): psychological conflict, mental retardation, inadequate educational opportunities, environmental disadvantage, sensory impairment, or neurological disease, (p. 178)

From our own perspective, it is important not to rush to judgment about the presence of a learning disability when a student fails to achieve as expected. The failure could also be due to a lack of interest in the subject or even a negative attitude toward the subject, a lack of motivation, personal or financial problems, dislike of the teacher, study habits unsuited to the discipline, or a variety of other causes. Thus, the documentation of a learning disability can and should be an extensive process that incorporates results from tests of general intelligence, academic achievement, and processing skills as well as a comprehensive personal history that covers a wide range of background information that might be relevant to the clinical assessment of an individual. When we look at samples of university policies in the next section, we note that a measure of language aptitude is sometimes included in this process, and sometimes it is not.

FIVE ETHICAL QUESTIONS REGARDING THE USE OF THE MLAT IN THE DIAGNOSIS OF FOREIGN LANGUAGE DISABILITY Is It Ethical to Grant Some Students a Waiver of the FL Requirement, Although Others Are Forced to Take a FL Against Their Will? This ethical problem is created by the very idea of granting an accommodation, a substitute, or a waiver of the FL requirement by an institution of higher education (IHE). An accommodation is any special service or assistance that the institution may provide to the student who has a FLLD. Scott and Manglitz (2000) described the kinds of accommodation that may be provided. These include a special section of a FL that covers one semester of course material during two semesters, provid-

166

R E E D A N D S T A N S F IE L D

ing for self-paced instruction and for special tutoring to those students with a FLLD. Students with problems hearing or processing spoken language can be counseled to take Latin, or they can be offered a section that de-emphasizes the spoken language. A substitute course or courses may be allowed if the IHE does not provide adequate accommodations to effectively address the student’s educational needs. Some of the benefits gained from FL study may also be gained from the substitute courses. A waiver is simply the exemption of the student from meeting the FL requirement. From an ethical perspective, the awarding of a substitute or a waiver may sometimes be discriminatory because it means that all students are not held to the same requirements, and not all get to benefit equally from the IFIE’s academic requirements. Substitutes or waivers may not be in the student’s interest because the student will lose some or all of the benefits of FL study. Accommodations ensure the inclusion of all students in the full academic program, including its requirements, and they ensure that all students derive its benefits. However, many students who receive the substitute or waiver do not complain about not being included because most would prefer to receive a substitute or waiver than to be forced to meet the academic requirement. In the United States, an answer to this ethical concern is found in federal and state law. Title 5, Section 504 of the Rehabilitation Act of 1973 (Sparks & Javorsky, 2000) requires institutions of higher education to address the needs of applicants and students with disabilities. The legislation goes beyond physical disabilities. For instance, regarding academic requirements it says, “an institution shall make such modifications to its academic requirements as are necessary to ensure that such requirements do not discriminate or have the effect of discriminating, on the basis of handicap, against a qualified handicapped applicant or student.” A learning disability may be considered a handicap. The federal government defines a learning disability as a disorder in one or more of the basic psychological processes involved in understanding or in using language, spoken or written, that may manifest itself in an imperfect ability to listen, think, speak, read, write, spell, or to do mathematical calculations, including conditions such as perceptual disabilities, brain injury, minimal brain dysfunction, dyslexia, and developmental aphasia. (U.S. Office of Education, 1977, p. 65083)

As can be seen, many of the skills and processes included in the definition relate to language. As a language aptitude measure, the MLAT claims to measure the examinee’s ability to learn and process language. Although the MLAT does not purport that only certain individuals can learn a language, it does claim that differences in language learning success are generally linked to differences in language learning aptitude. Prior to 1990, few universities interpreted the Rehabilitation Act of 1973 as requiring them to provide special accommodations to students who claim a FLLD.

U S IN G T H E M O D E R N L A N G U A G E A P T IT U D E T E S T

167

This is because the 1973 act only applied to programs in institutions of higher education that received direct federal support. However, the Americans With Disabilities Act (ADA) of 1990 extended the impact of this earlier act by elevating the requirement for accommodations to the level of a civil right, at least in the workplace and in workplace training, and even a general education has been liberally interpreted by some as training for the workplace. The Individuals With Disabilities Education Act (IDEA) of 1997 further reinforced the ADA. IDEA provides federal support to help institutions meet the expenses incurred in providing accommodations for students ages 3 to 21 who have 13 different categories of disabilities. One of these categories is “specific learning disabilities.” Although this support goes mainly to elementary and secondary schools, students who have received it are rapidly entering universities where they are demanding similar accommodations. Parents and advocates of students with learning disabilities tend to believe that college students with a learning disability should enjoy the same rights to adequate and appropriate accommodations as they did while enrolled in school at the secondary level. The previously referenced federal law, combined with the growing demands of students with disabilities and their advocates, who are often well educated, wealthy, and politically powerful, provides an answer to the first ethical question stated previously. Secondary-level students with bona fide and verifiable learning disabilities have a legal right to an appropriate accommodation. At the tertiary level, the same conclusion is held by many, but not all. As a result, if an appropriate accommodation cannot be provided, then the IHE may grant the student permission to take an alternate or substitute course or courses, or the IHE may grant a waiver. The legal requirements, combined with the threat of legal action, remove the concern from the realm of ethics. Ethics and law are widely recognized as two different realms. For example, courts do not concern themselves with ethics, only with determining if an individual or an organization has complied with the law. So, what at first glance may appear to be a legitimate ethical question has been trumped by legal requirements and by the fear of legal action. Yet in spite of the effect of legal requirements and the fear of litigation, there are alternatives to course substitutions and outright waivers.

Is it Ethical to Use the Test to Determine Whether a FL Requirement Waiver Should Be Granted When a Student Can Intentionally Fail the Test? It is axiomatic in testing that to obtain the best possible estimate of the examinee’s true score or ability level, the examinee must give his or her best effort. However, when students hope to be classified as FLLD, they may be motivated to do poorly on the test.

16 8

R E E D A N D S T A N S F IE L D

A number of safeguards may be employed to avoid this problem. Goodman, Freed, and McManus (1990) suggested One means of securing full cooperation is to inform students before the test that because the examiner is searching for a particular profile of strengths and weaknesses unknown to the examinee, the most likely means of revealing a disability is to do one’s best. (p. 133)

Because the student does not know what profile the examiner is looking for, he or she may be fearful of intentionally choosing the wrong answers. In fact, the search for patterns, or consistency across measures, is a common strategy of psychologists administering the test. For example, at the University of Pennsylvania the MLAT is used but not required. One pattern an LD specialist in a disabilities services office might look for is difficulty understanding spoken language, combined with evidence of auditory processing weakness as revealed by the Woodcock-Johnson Psycho-Educational Battery. Low scores on Parts 1 and 2 of the MLAT might reinforce this pattern. An important preventative measure is the interview by a counselor or psychologist, which might occur either before or after the MLAT is administered. Any possible motivation to intentionally fail might be revealed to the counselor. The counselor could even ask the student directly if he or she did his or her best on the test. In addition, the counselor can obtain the relevant language learning history on the student, including the names of the FL teachers that he or she has had. The counselor can then contact them for a written or verbal evaluation of the student’s attitude, effort, study habits, and evidence of any learning disability. Another preventive measure might be to remind the student either before the test or during the interview of the benefits of FL study. These are generally considered to be the development of some proficiency in a FL, an understanding of the history, geography, and culture of the people who speak that language, a better understanding of the grammar of one’s own language, and an increased ability to acquire a FL in the future. Other possible advantages, which examinees could be reminded of, may include an edge in the job market or in getting into a good graduate program. A long-term solution to the problem of students intentionally failing the MLAT might involve “early detection.” A language aptitude measure such as the MLAT-Elementary, a version of the MLAT for children in grades 3 to 6, could be given routinely to children who exhibit language-learning difficulties while in elementary school. Because language aptitude is supposedly fairly stable throughout one’s life, a measurement obtained early would be, in some sense, good for life, although more realistically it would provide corroborating evidence for a low MLAT score earned as an adult.

U S IN G T H E M O D E R N L A N G U A G E A P T IT U D E T E S T

169

Before being considered for a waiver, most schools require that the student be able to document a history of failures in FL study, with no successes (although performing adequately in other academic areas). Goodman et al. (1990) reported that at the University of Pennsylvania All students without a documented history of prior learning disability must pursue the foreign language requirement (four courses and a proficiency exam). Only after failure in a course and on the recommendation of the instructor are students eligible for foreign language exemption review, (p. 140)

The intent of this policy leads us directly into ethical concern number 3.

Is It Ethical to Force Students With a FLLD to Take and Fail a FL One or More Times Before Granting a Waiver? Schwarz (1997) and others have implied that some LD students are being treated unfairly by being required to try, and fail, language courses before being considered for a waiver. This is an especially common concern at universities, which do not have to waive the FL requirement, nor allow course substitutions, if the faculty body that sets academic standards considers the requirement to be an essential part of the academic curriculum (Guckenberger et al., 1996). Complicating this issue is a lack of confidence in the whole process of diagnosing LDs. Recall that one reason that students are forced to try a language course is to corroborate the initial diagnosis by demonstrating failure. At Yale, the MLAT is considered “one more perspective,” (Yale, personal communication, April 22, 2002) but not necessarily a central piece of the diagnosis. A low MLAT score alone is not considered strong evidence of a FLLD. Sparks and Javorsky (2000) strongly voice their opinion that “such uses of the MLAT are psychometrically indefensible” (p. 647). We agree that the MLAT should not be used in absolute isolation, but the other factors to consider do not necessarily have to include demonstration of multiple failures. There are many sources of evidence that can be brought to bear on each case. For example, the “Learning Disability Documentation Guidelines” at Columbia University Office of Disability Services is quite detailed in requiring standardized scores to document aptitude (i.e., a “complete intellectual assessment is required ... preferably utilizing the Wechsler Adult Intelligence Scale-Revised,” (Columbia University Office of Disability Service, 2004), developed abilities (e.g., The Woodcock-Johnson Psycho-Educational Battery Revised), processing skills, a detailed clinical summary, and the MLAT. Other schools, such as Dartmouth, Yale, and the University of Pennsylvania have similarly detailed policies. Finally, the construction of a personal history of relevant information, including statements by former FL teachers, can also obviate the need to force a college student to take and fail a FL one or more times.

170

R E E D A N D S T A N S F IE L D

Pending developments that might eventually engender greater confidence in language aptitude tests and in the overall diagnosis of FLLD, schools might consider the deeper intent of their FL requirements. The University of Wisconsin-Madison seems to have a sensible solution in offering a “foreign language substitution package ... [that] ... is designed to fulfill the faculty’s intention in requiring foreign language as a part of the college curriculum. Specifically, the courses in the Foreign Language Substitution Package provide— as does the foreign language requirement—students information about the structure of a language as well as the literature and culture of the people using that language. (University of Wisconsin-Madison, Undergraduate Catalogue 1999-2001)

In any event, the previous discussion presents several solutions to ethical concern number 3. Yet, the important issue of the valid assessment remains, and that brings us to ethical concern number 4.

Is It Ethical to Exclude a Measure of Language Aptitude When a Student is Being Evaluated for LDs, Particularly if the Student is Petitioning Specifically for an Exemption From a FL Requirement? Ethical concern number 4 confronts the validity issue. First, we must sort out the confusion that exists between other learning disabilities that impair FL learning, such as auditory processing weakness (e.g., as measured by the Woodcock-Johnson Psycho-Educational Battery) or dyslexia, and what might be thought of as a FLLD in its own right, one related to Carrolfs notion of language aptitude. In the case of the former disabilities, it is questionable whether it is appropriate to give the MLAT, which was designed to assess FL aptitude, not other, broader kinds of aptitude or disability. However, clearly the case has been made that there is a special cognitive basis to the language aptitude construct, and therefore a special FLLD almost certainly does exist. If that reasoning is accepted, then it follows that a measure of language aptitude is crucial in the diagnosis of a FLLD. One could even argue that it is unethical to assess FLLD without using an accepted measure of FL learning aptitude. Intelligence measures and other kinds of cognitive measures would not be good substitutes, although they might be used as additional evidence to provide a more thorough diagnostic evaluation. A summary statement in the MLAT manual (Carroll & Sapon, 2002) reads, “Research on the MLAT has several times made it possible to compare the validity of the MLAT with intelligence tests or with other predictors of success in foreign language learning. The comparison has nearly always favored the MLAT” (p. 25). Data reported by Sparks and Javorsky (2000) is also consistent with the proposal that there is a specific disability related to FL learning. They reported that not

U S IN G T H E M O D E R N L A N G U A G E A P T IT U D E T E S T

171

all students with the LD label have difficulty in FL classes, and somewhat conversely, not all students who have trouble in FL classes have a learning disability (but they do tend to have lower native language skills and lower MLAT scores; see Sparks & Javorsky, 2000, p. 646). This proposal, that there is a disability specific to FL learning, could potentially be countered by the suggestion that because language aptitude is relative (i.e., people learn at different rates), a low aptitude score does not really represent an inability to learn in the sense of real disability, which is characterized by an impairment so severe that it blocks learning. In fact, a quote from the MLAT manual itself might be cited as support for that argument: The MLAT does not claim to say whether an individual has a “language block” or some inherited disposition or trait which will prevent him or her from learning a foreign language. As far as is known, any individual who is able to use his mother tongue in the ordinary affairs of everyday life can also acquire similar competence in a second language, given time and opportunity. (Carroll & Sapon, 2002, p. 23)

The previous paragraph is an argument against the idea of a FLLD. However, the fact that language aptitude is a relative concept does not really distinguish it from other disabilities. People who are legally blind are not necessarily completely blind. People who are hearing impaired are not necessarily completely deaf. Actually, the suggestion that everyone can learn given enough time has pretty clear implications both for accommodations (i.e., give them more time and support and pace the instruction) and for substitutions and waivers (i.e., if it’s not possible to provide additional time or support, then grant the waiver or a substitution package). For several years now, the Public Service Commission of Canada has used the MLAT to place French language learners into “streams” that progress at various rates (Wesche, 1981). It appears that all students can succeed if placed in appropriate tracks. Obviously, that is a fairly ideal situation. At many schools the curriculum is more rigid; therefore, counselors and advisors should take the “speed” of the curriculum into account when deciding whether to grant accommodation, substitutions, or waivers. For example, at a highly competitive school, such as Yale, language classes proceed at a very rapid pace. It therefore would not seem reasonable to force a student who scored at the 5th percentile (i.e., in the bottom 5% relative to other students in the United States) on the MLAT to try to keep up FL study at Yale. In sum, the concept of FLLD needs to take into account both the students’ ability to learn and the typical pace of the instruction that the learner will be offered. Furthermore, although there are still questions about the appropriateness of using a language aptitude test to identify a FLLD, so too are there serious doubts about failing to administer such a measure. There is a clear need for additional research

172

R E E D A N D S T A N S F IE L D

on FLLD and for dissemination of the results. We suggest that the test publisher has a role to play here, which brings us to our final ethical concern.

How Can the Test Publisher Market and Sell the Test in an Ethical Way, Knowing That Its Use is Open to Misuse, or Even Abuse? Most tests, particularly off-the-shelf tests, are open to abuse to varying degrees. When it comes to the identification of a disability for purposes of obtaining a waiver, the MLAT is no different from other tests used in the LDs arena. We believe that the publisher of a test used for this purpose must be sensitive to the ethical dilemma that the use presents and that the publisher must be proactive in addressing these ethical concerns. If the publisher acts accordingly, then the ethical burden is alleviated. Here are some of the measures taken by Second Language Testing, Inc. (SLTI) to prevent misuse and to ensure the most appropriate use of the MLAT.

Test security. When an off-the-shelf test is used for moderate or high-stakes purposes, test security becomes a major concern. Not only will professionals want to purchase it, but examinees will too to learn how to score high or low on it, depending on what their goal may be. For the publisher, there are far more opportunities to sell to examinees than to professionals who want to make a legitimate use of the test. In the case of the MLAT, we require all purchasers to identify their professional affiliation and to sign a pledge that they will maintain test security and use the test in accordance with the American Psychological Association standards. Whenever we are uncertain about a new purchaser, we e-mail or call the purchaser requesting information on why they want to purchase the test. We have returned several checks as well as payments made over the Internet because it became clear that the person purchasing the test was an examinee who wanted to “prepare” for its formal administration. Although we believe that examinees have a right to basic information about a test they are about to take, we do not believe that they have a right to study the test prior to its administration. Therefore, we have placed a sample test consisting of 25 items on our Web site so that examinees wanting information on the test can see the nature of the items and sections. Therefore, they have no legitimate excuse or need to purchase the actual test. Research on test use. The publisher of a test whose use is subject to abuse carries a special burden to keep up with what users are doing to identify and prevent inappropriate or unethical use of the test. Even though the MLAT is a very small testing program, we periodically survey the users to better understand how they use the test, the other measures they use with it, and the actions that are taken as a result of performance on the test. We maintain a file on each user in which we

U S IN G T H E M O D E R N L A N G U A G E A P T IT U D E T E S T

173

describe what we know about how they use the test. We periodically review these files to synthesize information, and we sometimes call users and interview them over the phone to learn what they are doing and to engage them in a discussion of good test interpretation and practice.

Research on legal requirements. Not infrequently, we receive phone calls from potential test users about legal requirements, which we attempt to answer objectively and to the best of our ability. To better counsel potential test users, we study the literature on legal requirements and guidance published by government agencies. We find this literature hard to understand, as do the test users, for many reasons. For example, terminology is not used consistently, tests cited are not always familiar or explained, and diagnostic procedures are not uniform or are not always explicit. The lack of clarity and specificity in the legislation is one reason why there is considerable disagreement over what should be done after a probable FLLD is identified. Also, the guidance published by government agencies and LD advocates may be somewhat in conflict. Information dissemination. We believe that a publisher of a test in the LD arena has a special obligation to educate test users about the issues surrounding the uses of the test. We also believe that a test publisher should keep up with research on the test being done in the field. For these reasons, we scan books and research journals to identify relevant studies and reviews of the literature. We use this information to revise the test manual periodically. SLTI republished the MLAT in 1999. Then in 2000 we published our first revision of the manual. The most recent revision was published in 2002 (Carroll & Sapon, 2002). When a new manual is published, we put this information on our Web site so that those with the earlier edition can obtain the most current edition. We also give serious, objective presentations at conferences and professional meetings. We hope these presentations not only inform test users but also stimulate a discussion of appropriate and inappropriate uses of the test in the field. We keep the Web site updated as well, so that users can go there for the latest information on special issues and concerns. Multiple measures. As previously indicated, we support the use of multiple measures to construct a diagnostic profile of the examinee, rather than reliance on the MLAT as the sole indicator of a disability. In addition to aptitude tests, these should include a personal history and letters from teachers. Alternatives. We support the use of accommodations and substitution packages over outright waivers of the FL requirement. Qualifications. We try to emphasize the need for qualified individuals to be involved in the interpretation of test scores, and for those involved in deciding what

174

R E E D A N D STA N SFTELD

to do with test scores to be familiar with the literature on the test and on the construct it measures, so that they will make informed decisions regarding examinees. In the LD field, it is very important that the evaluation of the student be complete and that the interpretations and consequences of the scores be the result of reasonable, rational, and discerning thought by relevant professionals.

CONCLUSIONS In considering the use of the MLAT to identify a FLLD, we posed five questions that relate the test to ethical concerns. In our approach to each concern, we have attempted to explain the issues surrounding the concern and then reach some resolution pertaining to the test, its publisher, the examinee, the test administrator, and the test score user who may be involved in setting policy and making individual recommendations. Regarding the first question, we noted that ethics and law are different realms. Therefore, what is legal is not always what is ethical, yet for most users it is far more important than what is ethical. Regarding the concern that students can deliberately fail the test, we described strategies and procedures that would lessen the likelihood that this will happen and contravene the effect if it did happen. Particularly important are the role of a face-to-face interview, a personal history of problems with FL courses and language-related tasks. A long-term preventative measure is the early detection of a FLLD through the administration of a FL learning aptitude test, such as the MLAT-Elementary or the Pimsleur Language Aptitude Battery, at different points during a student’s educational career. This provides a means of constructing a history of low scores on aptitude measures over a long period of time. Such a history could greatly improve the credibility of decisions made about individuals who claim they are FLLD. Forcing a student to take and fail a FL course one or more times is criticized by advocates for the learning disabled, yet required by policymakers at many institutions. We noted that there are effective alternatives to this policy, including the use of accommodations and substitutions. We have included considerable discussion of validity in this article because validity is central to the discussion of appropriate use of a test. In that sense, we suggest that it may often be unethical to evaluate whether a student has a FLLD without administering a FL aptitude measure. Without the inclusion of such a measure in the diagnosis, some non-FLLD students may be falsely identified as FLLD, and therefore exempted from the benefits of pursuing the FL, whereas some truly FLLD students will be denied the accommodations, substitutions, or waivers they desperately need to graduate. Because the stakes are high, the measures used must be valid.

U S IN G T H E M O D E R N L A N G U A G E A P T IT U D E T E S T

175

Perhaps the toughest ethical question asks if a publisher should publish the MLAT knowing that its use is open to abuse. Here we outlined a variety of measures that help ensure the validity of scores and the appropriate application and interpretation of the instrument. Among these are serious efforts to maintain test security, staying abreast of research on the test, as well as legal requirements and disseminating objective information on the test through multiple means. We also strongly encourage the use of multiple measures or indicators along with the MLAT, the use of alternatives to waivers, and careful decision making by qualified and relevant professionals. We note that our goal of informing good test use is not easy to meet, as we are dealing with an interdisciplinary matter that involves many specialized domains. We need to know enough about the related areas to communicate with a wide range of professionals—guidance counselors, school psychologists, clinical neurologists, school administrators, FL educators, parents, and students. We encourage additional research on language aptitude, as well as its relation to other factors that influence language learning. We also encourage the development of other language aptitude measures. In closing, we believe that the notion of a FLLD is defensible, and that the MLAT contributes essential information to its diagnosis. We further believe that, although research and validation efforts should be ongoing, if the individual diagnosis is comprehensive and discerning, and if all of the safeguards we recommend are in place, then use of the MLAT to aid in determining if a student has a FLLD is both ethical and appropriate.

REFERENCES Americans with Disabilities Act of 1990. 20 U.S.C. § 12101 et seq. U.S. Office of Education, 1977. Carroll. J. B. (1962). The prediction of success in intensive foreign language training. In R. Glaser, (Ed.), P ittsb u rg h tra in in g research a n d e d u c a tio n (pp. 87-136). Pittsburgh. PA: University of Pittsburgh Press. Carroll, J. B. (1981). Twenty-five years of research on foreign language aptitude. In K. Diller (Ed.), I n d iv id u a l d ifferen c es a n d u n iv e rsa l! in la n g u a g e le a rn in g a p titu d e (pp. 83-118). Rowley, MA: Newbury House. Carroll, J. B., & Sapon, S. M. (2002). M o d e rn L a n g u a g e A p titu d e Test m anual. North Bethesda, MD: Second Language Testing, Inc. Cohen, J. (1983). Learning disabilities and the college student: Identification and diagnosis. In M. Sugar (Ed.), A d o le sc e n t p sy ch ia try, d e v e lo p m e n ta l a n d c lin ic a l stu d ie s (Vol. 9, pp. 177-198). Chicago: University of Chicago Press. Columbia University Office of Disability Services. (2004). L e a rn in g D isa b ility D o c u m e n ta tio n G u id e lines. Retrieved May 3, 2004 from http://www.health.columbia.edu/ods/pdfs/ldguides.pdf Ehrman.M. (1996). U n d ersta n d in g se c o n d la n g u a g e le a rn in g d ifficulties. Thousand Oaks, CA: Sage. Ehrman, M. (1998). The Modem Language Aptitude Test for predicting learning success and advising students. A p p lie d L a n g u a g e L ea rn in g , 9, 31-70.

176

R E E D A N D S T A N S F IE L D

Goodman, J., Freed, B., & McManus, W. (1990). Determining exemptions from foreign language requirements: Use of the Modem Language Aptitude Test. C o n te m p o ra ry E d u c a tio n a l P sychology, 15, 131-141. Guckenberger et al. v. Trustees of Boston University et al„ 974 F. Supp. 106 (D. Mass. 1996). Individuals with Disabilities Education Act, 20 U.S.C. §§1400etseq. U.S. Office of Education, 1977. Kavale, K. (1993). How many learning disabilities are there? A commentary on Stanovich's “Dysrationalia: A new specific learning disability.” J o u rn a l o f L e a rn in g D isabilities, 26, 520-523. Section 504 of the Rehabilitation Act of 1973, 29 U.S.C. § 794. Schwarz. R. L. (1997). L e a rn in g d isa b ilitie s a n d fo r e ig n la n g u a g e lea rn in g : A p a in fu l collisio n . From http://www.ldonline.org/ld_indepth/foreign_lang/painful_collision.html Scott, S. S., & Manglitz, E. (2000). F oreign la n g u a g e le a rn in g a n d le a rn in g d isa b ilities: M a k in g the co lleg e tra n sitio n . From http://www.ldonline.org/ld_indepth/foreign_lang/their_world_ 2000.html (reprinted by permission from Their World National Center for Learning Disabilities). Sparks, R„ & Javorsky, J. (2000). Section 504 and the Americans with Disabilities Act: Accommodating the learning disabled student in the foreign language curriculum (an update). Foreign L a n g u a g e A n n a ls, 33, 645-654. United States Office of Education. (1977). D e fin itio n a n d c rite ria f o r d efin in g stu d e n ts a s le a rn in g d is a b le d (Federal Register, 42:250, p. 65083). Washington, DC: U.S. Government Printing Office. University of Wisconsin-Madison. (2001). Foreign la n g u a g e: S u b stitu tio n s f o r stu d e n ts w ith certa in d isa b ilities, in the Undergraduate Catalogue 1999-2001. Retrieved May 3, 2004 from http://www. wisc.edu/pubs/home/archives/ug99/101ettsci/geninfo.html for Wesche, M. (1981). Language aptitude measures in streaming, matching students with methods, and diagnosis of learning problems. In K. Diller (Ed.), In d iv id u a l d ifferen c es a n d u n iv ersa ls in lang u a g e le a rn in g a p titu d e (pp. 119-154). Rowley, MA: Newbury House.

LANGUAGE ASSESSMENT QUARTERLY, 7(2&3), 177-193 Copyright © 2004, Lawrence Erlbaum Associates, Inc.

Ethical Considerations in the Assessment of the Language and Content Knowledge of U.S. School-Age English Learners Alison L. Bailey and Frances A. Butler

University o f California, Los Angeles, and The National Center fo r Research on Evaluation, Standards, and Student Testing

Within the broad context of school accountability, issues related to the equitable inclusion of English learners (ELs) in mandated content assessments present formidable challenges to educational professionals. We examine current U.S. assessment policies in terms of basic ethical considerations and educational value. We then propose that effective inclusion efforts must begin with equitable exposure to and learning of the academic language required in the classroom and on content tests. Reviewing previous studies of academic language (AL), we begin to place AL within an evidentiary-based framework for the development of AL proficiency assessments that will help determine whether EL students have the requisite language skills for taking content tests.

This article addresses the issue of ensuring the fair assessment of content knowledge acquired by students as they learn to speak, read, and write in their second language—English. We first review studies that have investigated the language factors and potential ethical issues related to inclusion of English learners (ELs)*1 in mandated assessments in U.S. public schools. Although these studies share the same objective to inform educational policymakers about the equitable inclusion of both EL students and English-proficient students on content-area assessments, we suggest a change of approach that will require future state inclusion efforts to Requests for reprints should be sent to Alison L. Bailey, Graduate School of Education, University of California, Los Angeles, CA 90095. E-mail: [email protected] 1E n g lish le a rn e r is used here to refer to a student who is in the process of acquiring English but who is not yet sufficiently proficient in English to participate in mainstream classes without English language support. These students are often referred to as limited English proficient (LEP) students or English language learners (ELLs) in the literature.

178

B A IL E Y A N D B U T L E R

begin by first ensuring equitable exposure to and learning of academic language (AL) and, second, to be held accountable for these efforts by devising academic language proficiency (ALP) assessments to help gauge student readiness for mainstream instruction and content-area testing. The scope of this assessment issue is best illustrated through data for the EL Pre-Kindergarten (Pre-K) through Grade 12 in U.S. schools. There were 4.58 million Pre-K -12 public school students designated LEP in the 2000 to 2001 school year. This constitutes an increase of 105% over the prior decade and amounts to just over 9.6% of the total public student enrollment nationally. California has the largest number of EL students with approximately 1.5 million, or 34% of the total U.S. EL population; the next largest EL population, Texas, reports 570,000 EL students (Kindler, 2002). By all accounts, recent immigration statistics suggest this increase will continue apace (Carnevale & Fry, 2000), with 22% of the nation’s school-age population the children of immigrants by 2010 (Fix & Passel, 1994).The overarching purpose of our research initiative at the Center for Research on Evaluation, Standards, and Student Testing (CRESST) is to determine the validity of administering large-scale content assessments in English to this growing number of students, thereby helping to assure the equitable assessment of all students in the United States, an agenda recommended to test developers and educational researchers nearly a decade ago (e.g., LaCellePeterson & Rivera, 1994). But it is one that remains unresolved and critical today given the No Child Left Behind (NCLB) Act of 2001 that requires states to measure both the content knowledge and English language development of EL students. The notion of ethics in this assessment arena is primarily articulated as equitable treatment of EL students given our increasing knowledge, as well as remaining gaps in knowledge, of the complexities involved in assessing students as they simultaneously acquire English language skills and learn content-area material. Specifically, equitable treatment must be realized by assuring the construct validity of both language and content assessments. The former requires assessing the relevant language proficiency for academic settings, and the latter requires assessing content knowledge in an accessible way. We also invoke other notions of ethical behavior in assessment practices such as fiscal responsibility and humane treatment of individuals when we consider the logic of implementing statewide assessment of students in a language they have yet to master. These considerations are consistent with the validity framework of Messick (1989) that deals with both construct validity facets of testing as well as consequential facets such as the social ramifications of testing. Previous research relevant to the efforts to ensure fair testing of EL students has taken place in four distinct areas: (a) attempts to determine the validity of using testing accommodations (e.g., extra time) with EL students (Abedi, Courtney, & Leon, 2001; Abedi, Lord, Hofstetter, & Baker, 2000; Butler & Stevens, 1997;

U .S . S C H O O L -A G E E N G L IS H L E A R N E R S

179

Castellon-Wellington, 1999; Rivera & Stansfield, 2001), (b) the language demands placed on students by content assessments (Bailey, 2000a; Butler, Stevens, & Castellon-Wellington, 1999), (c) the mismatch between the language of language assessments and the language of content assessments that calls into question the utility of existing language assessments for predicting student readiness to perform in content areas (Butler & Castellon-Wellington, 2000; Stevens, Butler, & Castellon-Wellington, 2000), and recently, (d) efforts to characterize the language demands of the mainstream classroom in terms of teacher talk and textbook language (e.g., Bailey, Butler, LaFramenta, & Ong, 2001; Cazden, 2001; Reppen, 2001; Schleppegrell, 2001). This article first addresses approaches to EL inclusion currently being practiced in K-12 assessment systems. We examine these approaches in terms of ethical considerations as well as question their educational value. Next we make the proposal that future inclusion efforts must begin with equitable exposure, not only to content-area material, but also exposure to and learning of AL required in mainstream schooling. Finally, we argue for the placement of AL within a broader evidentiary-based framework for use in AL proficiency assessment specifications and prototypes.

CURRENT APPROACHES TO ACHIEVING FAIR ASSESSMENT To date, most concern for the fair assessment of EL students has focused on what can be done when EL students are faced with being assessed in English on their content knowledge. States have a number of potential options available to them: (a) to include all students, both EL and English proficient, on standardized tests and report all scores; (b) to exclude or exempt some EL students from taking the test based on a range of criteria; (c) to include ELs in the tests but exclude their scores from reports; or (d) to provide some form of language support through the use of test accommodations. When Rivera, Stansfield, Scialdone, and Sharkey (2000) surveyed U.S. states on their practices, the states reported using just options (a), (b), and (d). Just two states, California and Ohio, required all students to take state assessments in the grades for which they are mandated (e.g., California, Grades 2-12); 46 states allowed exemptions based on a range of criteria (e.g., Texas, a state that, among other exemption criteria, includes level of literacy and oral language proficiency in English or Spanish or both, and amount of time spent learning English in schools); and 37 states allowed for the use of accommodations (e.g., New York, extra time with no limit). Clearly there is a lack of uniformity with inclusion policies at all levels. Duran, Brown, and McCall (2002) pointed out, moreover, that the accuracy of the Rivera et al. (2000) classifications may no longer be current when policies continue to differ from year to year and definitions

180

B A IL E Y A N D B U T L E R

used to identify the EL population vary from state to state. Indeed, the authors reported on a study of the Oregon Statewide Assessment System, which came into effect as early as 1991, that suggests to us there is possibly at least one state officially exercising a form of option (c). Oregon policies prohibit the inclusion of student scores in performance reports if those scores are obtained under modified assessment conditions (i.e., changes to the test administration or presentation that are thought to create invalid scores). We now turn to the ethical concerns that are specific to each of the options introduced previously.2

Blanket Inclusion Programs that serve EL students should account for the degree of student learning, as well as provide feedback to teachers about the success of their teaching practices. However, to ask an EL student with limited English proficiency to take a content-area examination in English could lead to a variety of outcomes all with questionable ethics. First and foremost it should be obvious, we feel, that forcing students to take a test in a language they do not know very well is a questionable exercise that may produce invalid results that provide little information about the content areas being tested. Such a mandate can be construed as unethical for a number of reasons, not the least of which is the reckless use of public monies needed to pay for this testing initiative, as well as the waste of human resources (both student and teacher), such as time and effort in administering, completing, and scoring. Inclusion means we are faced with the issue of what an EL student’s score actually means. Is it a measure of the student’s content-area knowledge or the student’s knowledge of English or more specifically the type of English commonly used in academic situations? The constructs being measured in the case of EL students are, it is claimed, likely different from the constructs being measured for English proficient students taking the test (LaCelle-Peterson & Rivera, 1994). We have sufficient evidence now that EL performance on content-area test items systematically differs from that of English-only or English-proficient students and that the differential is greater for language-dependent subject areas, such as language arts and social studies, than it is for subject areas such as mathematics (e.g., Abedi & Leon, 1999; Abedi, Leon, & Mirocha, 2000; Butler & Castellon-Wellington, 2000; MacSwan & Rolstad, 2003). Moreover, analysis of the language demands of the test items themselves has shown a correlation between high-demand items and the size of the differential between EL and English-proficient student performance (Bailey, 2000a). Specifically there is a larger performance differential on dis2For a fuller discussion of state inclusion policies and definitions of terminology used in those policies. see Rivera et al., 2000; also see National Research Council. 1999, for description of district-level policies on exemptions and accommodations.

U .S . S C H O O L -A G E E N G L IS H L E A R N E R S

1 81

course-rich mathematics items in contrast with mathematics items that make use of formulas and little prose. This provides support for the notion that the differential performance of EL and English-proficient students is more an issue of insufficient language knowledge than insufficient mathematics knowledge. These studies all lend credence to the claim that the constructs being measured by content-area tests are different for EL students than for English-proficient students, and it is therefore invalid to conclude that their performance on these assessments is in any way meaningful (Duran et al., 2002). To know this and to continue using such assessment systems with EL students in U.S. schools can be deemed unethical by the standards recently invoked by testing professionals. For example, Hamp-Lyons (1997) pointed out that although test developers cannot reasonably be held responsible for all consequences of misuse or misinterpretation of the tests they develop, we as a field can be held responsible for all those of which we are aware. Indeed, the Standards f o r E ducational an d P sychological Testing, published jointly by the American Educational Research Association (AERA), the American Psychological Association (APA), and the National Council on Measurement in Education (NCME; 1999), provide a comprehensive set of guidelines for test development and use, and they form the basis of recent efforts by a number of agencies and organizations providing standards for assessment practice in K -l 2 education (e.g., CRESST) and the testing of individuals of diverse linguistic backgrounds more specifically (e.g., Council of the Great City Schools & National Clearinghouse for English Language Acquisition & Language Instruction Educational Programs, 2002). The CRESST Standards f o r E ducational A ccountability System s (Baker, Linn, Herman, & Koretz, 2002), for example, take the joint AERA, APA, and NCME (1999) standards as a given and offer guidelines for good testing practices, one of which pertains directly to assessment practices with EL populations that may differ from the general student population, on which tests are typically normed. Following the CRESST standards, assessments used with a specific population—in this case EL students— should be normed on students who are representative of the EL students who will be expected to take the tests. This step is a crucial validity component of any test development process and is absolutely necessary to ensure that fair and equitable decisions can be made about the educational programming provided to EL students. Other ethical considerations of including EL students in content-area assessments remain of course. Students may often be negatively impacted emotionally and educationally by the experience of trying to perform on an impossible task. There are countless anecdotes in the literature and popular press of the kinds of upset and trauma students can suffer as a result of such an inclusion practice as being forced to take a test in a language they are still learning. As one principal dramatically sums up in a W ashington P ost report of Maryland’s efforts to increase the number of ELs taking statewide assessments, “In third grade, 10% of the kids got

182

B A IL E Y A N D B U T L E R

straight zeros” (Schulte, 2002 p. T03). Unfortunately, this type of outcome can lead to frustration on the part of EL students (Duran et al., 2002) and, like school district personnel, EL students and their families may very well interpret their performance as a true reflection of their content-area knowledge. Finally, not only is inclusion unfair to the individual EL students taking the tests, but the practice of including their scores in aggregate reporting of school performances to school districts and state departments of education is perceived to be unfair to the other students and teachers at these schools— consequently a misguided practice that LaCelle-Peterson and Rivera (1994) warned against years ago. The feeling that a school’s test scores suffer as a result of EL inclusion may lead to the unfair blaming of EL students when state and district sanctions are made on a low-performing school. As Schulte (2002) reported, “Many teachers... worry that declining scores as more students are tested will only breed resentment for non-English-speaking children in the school community” (p. T03). The ethical deployment of sanctions raises an important issue if sanctions are to be determined in part by the questionable inclusion of test scores from EL students. Moreover, accountability policies that use rewards and sanctions are made all the more vulnerable by potentially unfair practices when we consider the recent linkage of teacher merit pay to increases in student achievement scores (Helfand, 2000). Unlike the established procedures for administering and scoring most other high-stakes tests, such as the Scholastic Assessment Test used for college entrance and the Test of English as a Foreign Language in the college-level English language assessment arena, K - 12 assessment has allowed classroom teachers to proctor and administer examinations often with little or no training for ensuring consistent administration procedures. In some cases (e.g., K-12 English language assessments), teachers have even scored the assessments without the necessary training to ensure reliability. The potential for inconstancies that threatens test validity is thus already large, but the potential for unethical practices becomes so much the greater when the same teachers have a bias built in by the recent salary merit policies.

Exemption Practices Perhaps the fairer option then is to exempt EL students from taking content-area tests or to exclude them altogether while the assessment instruments we have available are (a) unable to discriminate between a demonstration of students’ content knowledge and students' English language ability, and (b) administered and interpreted in potentially biased ways. In the past, many states and school districts excluded EL students from taking standardized content tests (August & Lara, 1996). But issues of ethical practice surface here too: How will those teaching EL students be held accountable? How will we measure student growth and performance for diagnostic use?

U .S . S C H O O L -A G E E N G L IS H L E A R N E R S

183

Even if the exemption and exclusion of EL students were clearly fair alternatives, which the questions previously show them not to be, these options will no longer be available to school districts. The situation has altered with recent changes to the federal law that require the inclusion of EL students in new mandated assessment systems. The NCLB Act of 2001 increases school accountability by reauthorizing Title I of the Elementary and Secondary Education Act of 1965 (see NCLB Act, 2001), which provides local educational agencies resources to improve instruction in schools in high-poverty areas. The NCLB Act requires states to implement an accountability system that includes annual reading and math assessments aligned to state academic standards for all students in Grades 3 to 8. Schools that fail to make “adequate yearly progress” (p. 23) are subject to corrective action and restructuring measures. This ruling also pertains to all EL students who must be assessed in these subject areas “in a valid and reliable manner” (p. 28) including “in the language and form most likely to yield accurate data” (p. 28) unless they have attended school in the United States for 3 or more consecutive years at which time EL students must be assessed in English.3 Moreover, ELs are affected by Title III, the English Language Acquisition Act (NCLB Act, 2001b), which replaces the former Bilingual Education Act, Title VII. States must establish achievement objectives for the acquisition of English language proficiency and assess all EL students (NCLB Act, 2001b, p. 296). If EL students must be included in content-area assessments, what would make their inclusion meaningful, fiscally responsible, unbiased, and therefore ethical given what we now know about the questionable validity of interpreting current test results? One alternative that is being practiced in some states is the use of test accommodations to which we now turn.

Test Accommodations Specifically this approach has been implemented with changes to the content-area assessments themselves (e.g., language simplification), or to the testing situation (e.g., extra time) in an attempt to achieve a valid evaluation of EL students’ content knowledge. When last surveyed (1998-1999 school year), 37 U.S. states allowed accommodations (Rivera et al., 2000). The most popular accommodations involved modifying the classroom routine in terms of setting and timing. Setting accommodations are practices that change the environment in which the test is given (e.g., small group administration and preferential seating). Similarly, timing or scheduling accommodations include allowing students extra time, longer breaks, and allowing testing over several days. A small number of states offered a bilingual or native language option, and just two states modified their content-area assess3A local education authority may argue for an exception on a case-by-case basis for an extension of native language or accommodation use up to 2 additional years (NCLB Act, 2001a, p. 28).

184

B A IL E Y A N D B U T L E R

ments to provide a sheltered-English version. (For a detailed report, see Rivera et al., 2000, and for a review of issues, see Kopriva, 2000.) Recent research has investigated the role of test accommodations in creating a fair measure of EL students’ content-area knowledge. The results of the studies have been mixed with, on the one hand, some evidence that student performance is not significantly improved by the use of specific accommodations (e.g., Abedi et al., 2001; Castellon-Wellington, 1999) and, on the other, evidence showing modest but significant effects (e.g., Abedi, Lord, & Hofstetter, 1998; Abedi, Lord, & Plummer, 1997). The interaction of language proficiency level with choice of accommodation has been difficult to address in the accommodation research because of a lack of language tests that can help determine the type of accommodation that might be most beneficial for students at different proficiency levels. Of particular concern with this type of intervention is the possibility that the use of accommodations may alter the construct being assessed by giving unfair advantage to those students who receive an accommodation over those who do not. In some instances (e.g., provision of glossary plus extra time), English-proficient students benefit from accommodations more than their EL counterparts do (Abedi, Lord, et al., 2000; Olson & Goldstein, 1997). Evidence of this type raises issues about the validity of assessment results when accommodations are used selectively with EL students, and it becomes an ethical consequence when attempts to level the playing field for EL students knowingly disadvantages English-proficient students. As Heubert (2001) warned in other areas of educational policy (i.e., retention), “like physicians who pledge to ‘do no harm,’ we should avoid practices that research shows to be counterproductive.” Given that each of these policy approaches outlined previously has a drawback that at the very least compromises test validity, the need for an alternative approach to assessment of EL students remains.

A PROPOSED ALTERNATIVE: TEACHING AND TESTING ACADEMIC LANGUAGE We make the assumption that equitable inclusion in content-area assessments begins with equitable exposure to and learning of the type of language that will ultimately be required of students on those assessments. Although the accommodation approach can serve a role in current practice in a triage fashion, we argue for a new approach that equips ELs with the appropriate language abilities to be able to take tests without the need of accommodations. This should ultimately serve as the goal in the longer term strategies of educators who are addressing the issues of EL content-area assessment. Although 23 U.S. states monitor progress in English as an alternative accountability practice when EL students have been in the country for fewer than 3 years, it is unclear to what extent the English language instruction and accompa-

U .S . S C H O O L -A G E E N G L IS H L E A R N E R S

185

nying English language tests focus on the AL necessary for content-area performance. Research into AL thus has important implications for development of ALP tests that can be used to screen students to determine if they have sufficient and relevant English language abilities before they even attempt to take content-area assessments. In addition, for students who have some proficiency in English but who have not yet reached a sufficient level to handle content-area assessments without language support, ALP tests could be used to help determine the appropriate content-area test accommodations for individual ELs. Indeed, we have suggested elsewhere that the assessment of ALP could have an impact on the use of accommodations with EL students “in the creation of a set of principled procedures for implementing those accommodations” (Bailey & Butler, 2002, p. 34). ' In sum, an accountability system that attempts to be socially responsive must pay attention to both the evidential basis (e.g., construct validity) and the consequential basis, that is, test interpretation and test use that can have social consequences for test takers (Messick, 1989). Furthermore, the primary goal with ELs must be the acquisition of academic English and that en route to that goal, if test accommodations are used, their use should be systematic and appropriate.

Defining Academic Language Beginning with Cummins’ (1980) introduction of Basic Interpersonal Communicative Skills (BICS) and Cognitive Academic Language Proficiency (CALP), discussions of AL have helped focus attention on the complexities of language use in academic settings K-12. The distinction between BICS and CALP contrasts communication skills, acquired and used in everyday interactions, and the language proficiency acquired and used in the context of the classroom. A student who is academically proficient in a language (first or second) can use global and domain-specific vocabulary, language functions, and discourse structures in one or more fields of study to acquire new knowledge and skills, interact about a topic, or impart information to others. According to Chamot and O ’Malley (1994) the functions of AL (e.g., being able to explain, describe, contrast, etc.) are “the tasks that language users must be able to perform in the content areas” (p. 40). They are the range of communicative intents for which a teacher or student may use language in the classroom. Solomon and Rhodes (1995) took a sociolinguistic view that defines AL in terms of register. Johns (1997) identified a register of English used in professional books and characterized by the specific linguistic features associated with academic disciplines. Short (1994) documented the range of language functions found in social studies classes, including explanation, description, and justification functions. Similarly, Bailey et al. (2001) and Bailey, Butler, Borrego, LaFramenta, and Ong (2002) found that the language encountered in upper elementary science

186

B A IL E Y A N D B U T L E R

classrooms required students to comprehend language that was organized for specific purposes, namely explanations, descriptions, comparisons, and evaluations. Gibbons (1998) focused on the intertextual nature of classroom language and emphasized the importance of language use across the skill areas—the need to integrate oral and print language skills in classroom activities. The integration is important because oral and print language have different characteristics and consequently different demands (Schleppegrell, 2001). Cummins (2000) revisited the BICS-CALP distinction, and stressed the multidimensional nature of AL. Scarcella (in press) took a broad view of AL building on Kern’s (2000) model of academic literacy, which includes linguistic, cognitive, and sociocultural-psychological dimensions. More specifically, at the lexical level, Cunningham and Moore (1993) indicated that the classroom is the setting for the acquisition of academic vocabulary, vocabulary that students might not otherwise be exposed to. Stevens, Butler, and Castellon-Wellington (2000) identified three categories of words: (a) high-frequency general words, those words used regularly in everyday contexts; (b) n onspecialized academic words, those academic words that are used across content areas; and (c) specialized content area words, those academic words unique to specific content areas. The nonspecialized language that cuts across content areas is a form of AL that is not specific to any one content area (e.g., represent ) but is nevertheless a register or a precise way of using language that is often specific to educational settings. Specialized content-specific language includes the conceptual terminology of, for instance, social science (e.g., antitrust, Criscoe & Gee, 1984). Bailey et al. (2001) and Bailey et al. (2002), found that teachers rarely highlighted academic vocabulary during elementary science lessons, but overt instruction of specialized vocabulary occurred more often than overt instruction of nonspecialized vocabulary, and it frequently took the form of giving examples. Bailey (2000b; in press) pointed out that even the most common words can take on a very precise and possibly unfamiliar meaning in an academic context—for example, the use of the preposition by to mean according to as in the sentence I w ant you to sort these beans by color. Indeed, the science classroom observations (Bailey et al., 2001, 2002) reported previously revealed terms that have both precise scientific meanings and nonacademic usage. Gibbons (1998) noted that students’ familiarity with everyday language should be seen as a conduit for developing “the unfamiliar registers of school” (p. 99) and by a teacher introducing a more scientific term for a word used by a student to talk about an observation or phenomenon, the two co-construct concepts allowing the student to acquire more specific scientific terminology in the process. AL also implies ability on the part of students to express knowledge by using recognizable verbal and written academic formats. For example, students must learn acceptable, shared ways of presenting information to the teacher so that the teacher can successfully monitor learning. Indeed, the opportunity to display

U .S . S C H O O L -A G E E N G L IS H L E A R N E R S

187

knowledge is also an important feature of the classroom (Cazden, 2001) and may prove as crucial to students as the opportunity to learn. Moreover, AL use is often decontextualized whereby students do not receive aid from the immediate environment to construct meaning. There is little or no feedback on whether they are making sense to the listener or reader, so students must monitor their own performance (spoken or written) based on abstract representations of others’ knowledge, perspectives, and informational needs (e.g., Menyuk, 1995; Snow, 1991). Although the work cited here has done much to raise awareness about the construct of AL, there has not been a general, systematic approach to operationalizing AL for broad use in curriculum and language test development. We propose a research framework that will facilitate the articulation of the AL construct from an empirical base. This approach should allow for the specificity needed to produce test specifications for ALP assessments and detailed curricula guidelines for teaching AL. Figure 1 schematizes evidentiary bases for the characterization of AL and highlights the place of ALP assessment in a socially responsive accountability system. We adopted this particular framework partly in response to recent calls for the explicit articulation of evidence used in educational research (National Research Council, 2002), and partly because it clearly addresses the different contexts within which AL arises. We see a test of ALP as an intermediate step between the English language development (ELD) tests currently available and widely used

FIGURE 1 Evidentiary bases for the development of academic language proficiency (ALP) assessments.

188

B A IL E Y A N D B U T L E R

(e.g., Language Assessment Scales; Duncan & De Avila, 1990) and content-area assessments.4 Inclusion of this additional step in the assessment of ELs should provide for a fairer accountability system with this population of students. The justification for this intervening assessment stems from limitations of ELD assessments that focus, to a large degree, on social uses of English and not on the more scholastic use of English encountered in academic settings. As a consequence, we argue, there is little utility in using student scores on existing ELD assessments to gauge whether EL students are ready for redesignation to the mainstream classroom or whether their English is sufficiently proficient for them to meaningfully perform on content-area assessments. This is not to minimize the continued importance of measuring the development of student social uses of English especially because the social uses may form a foundation on which students build ALP. States have perhaps demonstrated their recognition of this mismatch or shortfall in the usefulness of existing ELD assessments by making performance on a language arts subsection of content-area assessments part of their redesignation criteria. This practice, however, hinges on circular logic— some states and districts are apparently using a content-area subtest, such as reading, to help determine if EL students are indeed ready to take the same content-area test that includes the subtest. Although such subtests might in fact better reflect the type of language students must be able to understand and use than existing ELD tests, they were not developed to assess ELD and have not been normed on EL students. Furthermore, it is not good testing practice to use a test as its own predictor. Thus a test expressly developed to test growth in the kinds of comprehensive English language skills needed for school success would seem a more logical approach in the fair assessment of EL students. As an initial step toward the development of such a test, we have identified six potential sources of evidence of AL to serve as bases for defining the construct. These sources of evidence are arrayed in clockwise order around the ALP test starting in the upper right comer of Figure 1. The first circle represents existing evidence of the range and number of lexical, syntactic, and discourse demands students encounter on content-area tests (Bailey 2000a; Stevens et al., 2000), as well as evidence of less demanding language found in ELD tests (Stevens et ah, 2000). The three circles at the bottom of the figure represent work currently in progress that focuses on the implicit and explicit language expectations made in the content-area standards of four key states (California, Texas, Florida, and New York), English as a second language (ESL) standards (Teachers of English to Speakers of

4The content-area assessments need not be limited to norm-referenced assessments but could be criterion-referenced (e.g., performance or portfolio) assessments— types of assessment that may prove to have similar or unique (i.e., even greater) AL demands.

U .S . S C H O O L -A G E E N G L IS H L E A R N E R S

189

Other Languages [TESOL], 1997), as well as content-area standards published by national organizations (e.g., National Science Teachers Association). An important caveat here is that any effort to align an ALP assessment to such standards is undermined in the absence of data to support the choice of content coverage and levels of difficulty that they adopt. Justification for the use of standards in our work and the role of standards in general that are not empirically based has thus emerged as a concern. For example, McKay (2000) called for a sound theoretical base for the construction of standards in the ESL arena. In spite of this caution, we have included analyses of standards documents because the standards are aspirational in nature, representing a desired level or type of instruction and student knowledge base if not actual classroom practice. Indeed, standards are an important piece of current educational reform in the United States and thus must be considered in an attempt to revise existing policy and practices. The upper left circle represents the need for documenting the linguistic expectations mainstream teachers have for their EL students. This additional information will help determine the language demands and the time frame for acquisition that students face in the opinions of the teachers who teach them (Hicks, 1994). The remaining circle represents the need to characterize the language embedded in oral and written classroom discourse, particularly the description of AL in mainstream content-area classes that can serve as a baseline for AL expectations. As our future research builds on the foundation we have outlined here, it will be possible to identify AL features that cut across content areas and those that are specific to a particular content area. Development of language test specifications that focus on both oral and written AL will serve the long-term goal of developing a test framework that is based on empirical data culminating in ALP prototypes for use in the meaningful and fair measurement of the English language proficiency of ELs.

IMPLICATIONS FOR BROADER EDUCATIONAL APPLICATION Implicit in the framework outlined previously is the belief that AL development is or should be within the purview of content-area teachers as well as language arts teachers and, in the case of ELs, ESL specialists. Therefore, knowledge of ESL teaching techniques could be helpful to content teachers in instructing students in the acquisition of AL. Indeed, researchers have noted elsewhere that mainstream teachers could and should have exposure to ESL techniques when teaching content-area classes (e.g., Kaufman, 1997), if not a greater knowledge of language development in general (Wong Fillmore & Snow, 2000). Chamot and O ’Malley (1994) provided a specific curricula approach to combining English language development with other content-area instruction in the classroom. In this regard, the ESL Standards fo r Pre-K-12 Students (TESOL, 1997) can also serve as a guide to

190

B A IL E Y A N D B U T L E R

content-area teachers who seek to explicitly integrate language into content instruction. Although content teachers have national and state content standards available to them, those standards rarely provide guidance regarding language development and use because they assume student proficiency in English. Thus, as we move to engage content-area teachers in their students’ development of AL, the TESOL ESL standards can serve as a critical link between language and content. Moreover, because AL can be conceived of as a “second language” for many school-age children, both those who speak a native language other than English and those who speak a dialect other than the mainstream dialect (Bailey & Butler, 2002; Corson, 1997; Kuehn, 2000), this effort should not pertain exclusively to the instruction and assessment of EL students. Indeed, it is unlikely that any child receives full exposure to academic registers other than in the formal setting of the school—hence the importance of exploring the teaching of AL explicitly to all students.

ACKNOWLEDGMENTS The authors thank Malka Borrego and Danna Schacter for their assistance in the preparation of this article. The work reported herein was supported under the Office of Bilingual Education and Minority Languages Affairs and the Educational Research and Development Centers Program, PR/Award Number R305B60002, as administered by the former Office of Educational Research and Improvement, U.S. Department of Education. The findings and opinions expressed in this report do not reflect the positions or policies of the National Institute on Student Achievement, Curriculum, and Assessment, the former Office of Educational Research and Improvement, or the U.S. Department of Education.

REFERENCES Abedi, J., Courtney, M., & Leon. S. (2001). L a n g u a g e a c c o m m o d a tio n s f o r la rg e -sca le a ss e s sm e n t in science. Los Angeles: University of California, Center for Research on Evaluation, Standards, and Student Testing. Abedi. J., & Leon, S. (1999). Im p a c t o f stu d e n ts ’la n g u a g e b a c kg ro u n d on c o n te n t-b a se d p e rfo rm a n c e : A n a ly se s o f exta n t data. Los Angeles: University of California. Center for Research on Evaluation, Standards, and Student Testing. Abedi. J., Leon, S., & Mirocha, J. (2000). Examining ELL and non-ELL student performance differences and their relationship to background factors: Continued analyses of extant data. In The v a lid ity o f a d m in iste rin g la rg e -sca le c o n te n t a sse ssm e n ts to E n g lish lan g u a g e learners: A n in vestigation fr o m three p e rsp e c tiv e s (pp. 3-49). Los Angeles: University of California, Center for Research on

Evaluation, Standards, and Student Testing. Abedi. J., Lord, C.. & Hofstetter, C. (1998). Im p a c t o f se le c te d b a c k g ro u n d v a ria b le s on s tu d e n ts ' N A E P m a th p e r fo rm a n c e (CSE Tech. Rep. No. 478). Los Angeles: University of California, Center for Research on Evaluation, Standards, and Student Testing.

U .S. S C H O O L -A G E E N G L IS H L E A R N E R S

1 91

Abedi, J., Lord, C., Hofstetter, C., & Baker, E, (2000). Impact of accommodation strategies on English language learners’test performance. E d u c a tio n a l M e a su rem en t: Issu e s a n d P ractice, 19 (3), 16-26. Abedi, J., Lord, C., & Hofstetter, C. (1998). Im p a c t o f se le c te d b a c k g ro u n d v a ria b le s o n s tu d e n ts ’ N A E P m a th p e r fo rm a n c e (CSE Tech. Rep. No. 478). Los Angeles: University of California, Center for Research on Evaluation, Standards, and Student Testing. Abedi, J., Lord, C., & Plummer, J. (1997). F in a l re p o rt o f la n g u a g e b a c k g ro u n d a s a v a ria b le in N A E P m a th e m a tic s p e r fo rm a n c e (CSE Tech. Rep. No. 429). Los Angeles: University of California, Center for Research on Evaluation, Standards, and Student Testing. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). S ta n d a rd s f o r e d u c a tio n a l a n d p sy c h o lo g ic a l testing. Washington, DC: American Educational Research Association. August, D„ & Lara, J. (1996). S y ste m a tic refo rm a n d lim ite d E n g lish p ro fic ie n t stu d e n ts. Washington. DC: Council of Chief State School Officers. Bailey, A. L. (2000a). Language analysis of standardized achievement tests: Considerations in the assessment of English language learners. In T h e v a lid ity o f a d m in iste rin g la rg e -sca le co n te n t a ss e s sm e n ts to E n g lish la n g u a g e lea rn ers: A n in vestig a tio n fr o m three p e r sp e c tiv e s (pp. 85-115). Los Angeles: University of California, Center for Research on Evaluation, Standards, and Student Testing. Bailey, A. L. (2000b, Fall/Winter). Learning to read makes language learners of us all. C e n te r X Forum , /(1), 1,9. Retrieved May 1, 2002, from University of California, Los Angeles. Graduate School of Education Web site: http://www.centerx.gseis.ucla.edu/forum/ Bailey, A. L. (in press). Academic language: Its conceptualization, operationalization and use in the assessment of ELL students. In T. Wiley & K. Rolstad (Eds.), Rethinking School Language. Mahwah, NJ: Lawrence Erlbaum Associates, Inc. Bailey, A. L„ & Butler, F. A. (2002). A n evid e n tia ry fr a m e w o r k f o r o p e ra tio n a lizin g a c a d e m ic la n g u a g e f o r b ro a d a p p lic a tio n to K - 1 2 ed u c a tio n : A d e sig n d o cu m en t. Los Angeles: University of California, Center for Research on Evaluation, Standards, and Student Testing. Bailey, A. L„ Butler, F. A. Borrego, M., LaFramenta, C., & Ong, C. (2002, Summer). Towards the characterization of academic language. L a n g u a g e Testing U pdate, 31, 45-52. Bailey, A. L., Butler, F. A., LaFramenta, C., & Ong, C. (2001). Tow ards the c h a ra cteriza tio n o f a c a d e m ic la nguage. Los Angeles: University of California, Center for Research on Evaluation, Standards, and Student Testing. Baker, E., Linn, R., Herman, J., & Koretz, D. (2002). S ta n d a rd s f o r e d u c a tio n a l a cc o u n ta b ility sy stem s (Policy Brief No. 5). Los Angeles: University of California, Center for Research on Evaluation, Standards, and Student Testing. Butler, F. A., & Castellon-Wellington, M. (2000). Students’ concurrent performance on tests of English language proficiency and academic achievement. In T he v a lid ity o f a d m in iste rin g la rg e -sca le co n te n t a sse ssm e n ts to E n g lish la n g u a g e lea rn ers: A n in vestig a tio n fr o m three p e r sp e c tiv e s (pp. 51-83). Los Angeles: University of California, Center for Research on Evaluation, Standards, and Student Testing. Butler, F. A.. & Stevens, R. (1997). A c co m m o d a tio n stra te g ie sfo r E nglish language learners on large-scale assessm ents: S tu d en t ch a ra cteristics a n d o th e r co n sid era tio n s (CSE Tech. Rep. No. 448). Los Angeles: University of California, Center for Research on Evaluation, Standards, and Student Testing. Butler, F. A., Stevens, R.. & Castellon-Wellington, M. (1999 ). A c a d e m ic la n g u a g e p ro fic ie n c y ta sk d e ve lo p m e n t p ro c ess. Los Angeles: University of California, Center for Research on Evaluation, Standards, and Student Testing. Camevale, A. P , & Fry, R. (2000). C ro ssin g th e g re a t d ivid e: C a n w e a ch ieve e q u ity w hen g en e ra tio n Y g o e s to co lleg e? Princeton, NJ: Educational Testing Service. Castellon-Wellington, M. (1999). The im p a c t o f p re fe re n c e f o r a c c o m m o d a tio n s: T he p e r fo rm a n c e o f E n g lish la n g u a g e le a rn e rs o n la rg e -sca le a c a d e m ic a c h ie v e m e n t te sts (CSE Tech. Rep. No. 524). Los Angeles: University of California, Center for Research on Evaluation, Standards, and Student Testing.

192

B A IL E Y A N D B U T L E R

Cazden, C. (2001). C la ssro o m d isc o u rse : The la n g u a g e o f teaching a n d lea rn in g (2nd ed.). Portsmouth, NH: Heinemann. Chamot, A., & O’Malley, J. (1994). The C A L L A h a n d b o o k: Im p le m e n tin g the c o g n itive a c a d e m ic la n g u a g e le a rn in g a pproach. Reading, MA: Addison-Wesley. Corson, D. (1997), The learning and use of academic English words. L a n g u a g e L earning. 47, 671-718. Council of the Great City Schools & National Clearinghouse for English Language Acquisition & Language Instruction Educational Programs. (2002, March). A sse ss m e n t sta n d a rd s f o r E n g lish lan g u a g e le a rn e rs (Draft executive summary). Washington, DC: Author. Criscoe, B., & Gee, T. (1984). C o n te n t reading: A d ia g n o stic /p re sc rip tiv e a pproach. Englewood Cliffs, NJ: Prentice Hall. Cummins, J. (1980). The construct of proficiency in bilingual education. In J. Alatis (Ed.), G eo rg eto w n U n iversity R o u n d Table on L a n g u a g e s a n d L in g u istics: C u rrent Issu e s in B ilin g u a l E du ca tio n , 1980,

81-103. Cummins, J. (2000). L a n g u a g e, p o w e r a n d p e d a g o g y : B ilin g u a l children in th e crossfire. Clevedon, England: Multilingual Matters. Cunningham, J., & Moore, D. (1993). The contribution of understanding academic vocabulary to answering comprehension questions. J o u rn a l o f R e a d in g B eh a viors, 25, 171-180. Duncan, S., & De Avila, E. (1990). L a n g u a g e A sse ss m e n t S c a le s R e a d in g a n d W riting C om ponent, F orm s IA a n d J A . Monterey, CA: CTB/McGraw-Hill. Duran, R., Brown, C., & McCall, M. (2002). Assessment of English-language learners in the Oregon statewide assessment system: National and state perspectives. In G. Tindal & T. Haladyna (Eds.), L a rg e-sca le a sse ssm e n t p ro g ra m s f o r a lt stu d en ts: Validity, tec h n ic a l a dequacy, a n d im p lem en ta tio n

(pp. 369-392. Mahwah, NJ: Lawrence Erlbaum Associates, Inc. Fix, M., & Passel. J. (1994). Im m ig ra tio n a n d im m ig ra n ts: S ettin g th e record straight. Washington, DC: Urban Institute Press. Gibbons, P. (1998). Classroom talk and the learning of new registers in a second language. L ang u a g e a n d E du ca tio n , 12, 99-118. Hamp-Lyons, L. (1997). Washback, impact, and validity: Ethical concerns. L a n g u a g e Testing, 14, 295-303. Helfand, D. (2000, March 4). Merit pay proposed for LA teachers. L o s A n g e le s Times, pp. A l, A 19. Heubert, J. (2001, June). Alumni award acceptance speech, Harvard Graduate School of Education. Cambridge, MA. Hicks, D. (1994). Individual and social meanings in the classroom: Narrative discourse as a boundary phenomenon. J o u r n a l o f N a rra tiv e a n d L ife H istory, 4, 215-240. Johns, A. (1997). Text, role, a n d context: D e v e lo p in g a c a d e m ic literacies. Cambridge, England: Cambridge University Press. Kaufman, D. (1997). Collaborative approaches in preparing teachers for content-based and language-enhanced settings. In M. A. Snow & D. Britton (Eds.), The c o n te n t-b a se d classroom : P ersp e c tiv e s on in teg ra tin g la n g u a g e a n d c o n te n t (pp. 175-187). New York: Longman. Kern, R. (2000). Notions of literacy. In R. Kern (Ed.), L ite ra c y a n d lan g u a g e tea ch in g (pp. 13—41). New York: Oxford University Press. Kindler, A. (2002). S u m m a ry re p o rt o f th e su rv e y o f th e s ta te s ' lim ite d E n g lish p ro fic ie n t stu d e n ts a n d a v a ila b le e d u c a tio n a l p ro g ra m s a n d se rvices, 2 0 0 0 -2 0 0 1 . Washington, DC: National Clearinghouse for English Language Acquisition. Kopriva, R. (2000). E n su rin g a c c u ra c y in testin g f o r E n g lish lang u a g e learners. Washington, DC: Council of Chief State School Officers. Kuehn, P. (2000). A c a d e m ic la n g u a g e a sse ssm e n t a n d d e v e lo p m e n t o f in d ivid u a l needs. Book 1. Boston, MA: Pearson. LaCelle-Peterson, M.. & Rivera, C. (1994). Is it real for all kids? A framework for equitable assessment policies for English language learners. H a rv a rd E d u c a tio n a l Review , 6 4 (1), 55-75.

U .S . S C H O O L -A G E E N G L IS H L E A R N E R S

193

MacSwan, J., & Rolstad, K. (2003). Linguistic diversity, schooling, and social class: Rethinking our conception of language proficiency in language minority education (pp. 329-340). In C. Paulston & G. Tucker (Eds.), E sse n tia l re a d in g s in so c io lin g u istic s. Oxford, England: Blackwell. McKay, R (2000). On ESL standards for school-age learners. L a n g u a g e Testing. 17, 185-214. Menyuk, P. (1995). Language development and education. J o u rn a l o f E d ucation, 177, 39-62. Messick, S. (1989). Validity. In R. Linn, E d u c a tio n a l m e a su re m e n t (3rd ed.. pp. 13-103). New York: Macmillan. National Research Council. (1999). H ig h sta k es: Testing f o r tracking, pro m o tio n , a n d g raduation. Washington, DC: National Academy Press. National Research Council. (2002). S c ien tific research in ed u ca tio n . Washington, DC: National Academy Press. No Child Left Behind (2001a). No Child Left Behind. Title 1: Improving the academic achievement of the disadvantaged. 107th Congress, 1st Session, December 13, 2001. (Printed version prepared by the National Clearinghouse for Bilingual Education). Washington, DC: George Washington University, National Clearinghouse for Bilingual Education. No Child Left Behind (2001b). No Child Left Behind. Title 3: Language instruction for limited English proficient and immigrant students. 107th Congress, 1st Session, December 13, 2001. (Printed version prepared by the National Clearinghouse for Bilingual Education). Washington, DC: George Washington University, National Clearinghouse for Bilingual Education. Olson, J., & Goldstein, A. (1997). T he in clu sio n o f stu d e n ts w ith d isa b ilitie s a n d lim ite d E n g lish p r o ficie n c y stu d e n ts in la rg e -sca le a ssessm en ts: A su m m a ry o f recent p ro g ress (NCES 97B482). Washington, DC: U.S. Department of Education, National Center for Education Statistics. Reppen, R. (2001). Register variation in student and adult speech and writing. In S. Conrad & D. Biber (Eds.), Variation in E n g lish : M u lti-d im e n s io n a l stu d ie s (pp. 187-199). Harlow, Essex, England: Pearson Education. Rivera, C., & Stansfield, C. (2001, April). The e ffe c ts o f lin g u istic sim p lifica tio n o f sc ie n c e test ite m s on p e r fo rm a n c e o f lim ite d E n g lish p ro fic ie n t a n d m o n o lin g u a l E n g lish -sp e a k in g students. Paper presented at the annual meeting of the American Educational Research Association, Seattle, WA. Rivera, C., Stansfield, C., Scialdone, L., & Sharkey, M. (2000). A n a nalysis o f state p o lic ie s f o r the inclusion a n d a cco m m o d a tio n o f E n g lish la n g u a g e lea rn ers in sta te a ssessm en t p rogram s d uring 1998-1999.

Arlington, VA: George Washington University, Center for Equity and Excellence in Education. Scarcella, R. (in press). K ey issu es in a cce le ra tin g E n g lish la n g u a g e d evelopm e nt. Berkeley, CA: University of California Press. Schleppegrell, M. (2001). Linguistic features of the language of schooling. L in g u istic s a n d E ducation, 12, 431^159. Schulte, B. (2002, February 28). Md. officials want ESOL students to take statewide tests. T he W ashin g to n Post, p. T03. Short, D. (1994). Expanding middle school horizons: Integrating language, culture, and social studies. T E S O L Q uarterly, 28. 581-608. Snow, C. (1991). Diverse conversational contexts for the acquisition of various language skills. In J. Miller (Ed.), R esea rch on ch ild la n g u a g e d iso rd ers (pp. 105-124). Austin, TX: p r o -e d . Solomon, J., & Rhodes, N. (1995). C o n c e p tu a lizin g a c a d e m ic la n g u a g e (Research Rep. No. 15). Santa Cruz: University of California, National Center for Research on Cultural Diversity and Second Language Learning. Stevens, R., Butler, F. A., & Castellon-Wellington, M. (2000). A c a d e m ic lan g u a g e a n d c o n te n t a ss e s sm en t: M e a su rin g th e p ro g ress o f E L L s (CSE Tech. Rep. No. 552). Los Angeles: University of California, Center for Research on Evaluation, Standards, and Student Testing. Teachers of English to Speakers of Other Languages (1997). E S L sta n d a rd s f o r p r e -K -1 2 stu d e n ts. Alexandria, VA: Author. Wong Fillmore, L., & Snow, C. (2000). W hat te a ch ers n e e d to k n o w a b o u t language. ERIC Clearinghouse on Languages and Linguistics. Retrieved August 1, 2000, from http://www.cal.org/ericcll

LANGUAGE ASSESSMENT QUARTERLY, / ( 2&3). 195-204 Copyright © 2004, Lawrence Erlbaum Associates, Inc.

Filmic Portrayals of Cheating or Fraud in Examinations and Competitions Victoria Byczkiewicz

California State University, Los Angeles

Cheating manifests in creative ways in examinations and competitions in educational settings and is justified by numerous rationales or codes of (im)morality. Through the analysis of 3 filmic representations of such instances, this article examines ethical problems that arise in educational contexts and deals explicitly with notions of right conduct, fairness, and justice. The potential for conflict between the teacher-as-moral-agent’s abstract, absolute beliefs and the complexities presented by the ethical dilemma in its immediate context is explored. It was concluded that teachers’ decisions ultimately hinge on personally held beliefs regarding social and political inequities and that an educator’s professional and personal identities are inseparable.

Every educationist’s sense of right and wrong is deeply rooted in culturally or personally based beliefs. It follows that few practical solutions to ethical dilemmas are found with reference to the generic prescriptions of the Code of Ethics of the Education Profession (CEEP) (Strike & Soltis, 1998) or in the static, brittle pages of countless case studies. Cheating, the object of this discussion, manifests in various predictable, although also cunning and surprising, forms. It occurs in degrees of relative severity, from the uncalculated impulsive glimpse an uncertain student guiltily casts toward his neighbor’s test article, to acts of minor plagiarism symptomatic of academic indolence or poor study skills, to carefully premeditated acts of fraudulence committed without compunction. There is most often no blanket prescriptive response to any one of these manifestations—no manual, program, or party line to serve as a convenient guide. To fully comprehend the decisions that teachers face in the private domain of their classrooms, the moral agent is best off

Requests for reprints should be sent to Victoria Byczkiewicz. TESOL Program, Charter College of Education, California State University, Los Angeles, 5151 State University Drive, Los Angeles, CA 90032. E -m a il: vik to ria @ sb c g lo h a I.n e t

196

B Y C Z K IE W IC Z

investigating “belief or faith rather than ... pure logic” (Johnston, 2003, p. 9). Dramatizations of ethical problems (as in films) that typically surface in educational settings resonate with a broad audience and therefore serve well in exploratory and instructional contexts that seek to probe the nature of morality (Light, 2003; Marshall, 2003).' ' ~

THE SCOPE; C H E A T E R S , T H E E M P E R O R ’S AND S T A N D A N D D E L IV E R

C LU B

With this framework in mind, in this article I compare three popular films to examine portrayals of cheating and fraud in examinations and competitions: Cheaters (2000) directed by John Stockwell, and The Emperor’s Club (2002) directed by Michael Hoffman, and a few scenes from Ramon Menendez’s Stand and Deliver (1988), based on a true tale from Los Angeles’s east side. All of these films deal explicitly with notions of right conduct, fairness, or justice in an educational context and specifically with issues of cheating or fraud in assessment or competitions. They kindle our awareness of how assessment repeatedly emerges as the main academic practice around which critical ethical problems revolve. Both Cheaters and The Emperor's Club highlight obvious ethical conundrums and juxtapose stark clashes in the teachers’-protagonists’ fundamental personal values. They illustrate that even though “certain beliefs may be absolute,” solutions to complex moral issues are “fundamentally dependent on context” (Johnston, 2003, p. 7), as both teachers resort to violating their personal codes of ethics in the unexpected situations in which they find themselves. Stand and Deliver boldly indicts the Educational Testing Service (ETS, Princeton, NJ), the well-known developer and administrator of the SAT, for alleged racist suppositions and unbridled power in terms of its actions.

EDUCATORS COMPROMISE THEIR ETHICAL STANDARDS In Cheaters (the title of which makes explicit its subject), a group of Polish American teens from working-class backgrounds unites in an overt effort to win a statewide academic contest by cheating. Although their coach, Dr. Plecki (played by Jeff Daniels), entertains an idealistic self-image as mentor and has already invested impressive energy in arousing his otherwise lackadaisical students’ morale to 'A priori discussions of hypothetical scenarios by which the educator may potentially be confronted are imperative to the conscientious preparation of professional educators in that they incite reflective—and reflexive—processes that encourage professional and personal maturation (Stoehr,

2002).

F IL M IC PO R T R A Y A LS O F C H E A T IN G O R F R A U D

197

fairly achieve competitive test results, things go terribly awry when he is just as easily swayed to cheat as the student who procures an illicit copy of the forthcoming exam. For a mere $20, all unanimously agree that this is “the real deal.” In all fairness, Plecki does not join his students in reaching this conclusion without a fair amount of tortured inner grappling and flagellant incredulity at his own wayward proclivity. Initially, Plecki has loftily planned to transmit as much factual knowledge as his students can digest in a rigorous regimen of morning, noon, and night cram sessions. Yet an untoward reversal of personal moral standards in the face of temptation leads him to convince a skeptical student that cheating is indeed acceptable when it promises to get one ahead. His fixation on results rather than process epitomizes Plecki’s moral decline. He even advocates mindful cheating (“If you’re gonna cheat, cheat smart” Stockwell, 2000). Conversely, throughout The Emperor’s Club, set in the 1970s at an exclusive young men’s preparatory academy, the venerable instructor Mr. Hundert (played by Kevin Kline) spouts ambiguous platitudes such as “It is not living that is important, but living rightly” (Hoffman, 2002). These postulations reverberate in his own actions as well as those of some of his more virtuous students. It happens that the most incorrigible student, one saucy Sedgewick Bell, seduces Mr. Hundert into compromising his own moral standards and those of the academy by manipulating him into bending the rules in his favor. Bell, who shows academic promise but needs special attention, becomes Mr. Hundert’s official reform project. Failing to adhere to the rigid deontological theory he preaches to embrace duty and obligation, Hundert elevates Bell’s true academic rank to qualify him as a finalist in the esteemed annual Mr. Julius Caesar competition, in the process demoting the better, truly deserving student to fourth place. The viewer’s sympathies extend to Hundert in that his manipulation of the assessment results is naively intended to bolster the academic zeal of a youth perceived to be faltering under enormous paternal pressure; however, from the viewer’s vantage point this breach constitutes an ethically questionable move that unleashes a series of unfortunate ramifications. This is the film’s pivotal event, the finale of which is carried out decades later to further underscore the futility of Hundert’s impulsively conceived project. Having proven himself to be capable of consciously compromising his standards in favor of an outcome he can only hope will pan out (i.e., Bell’s moral reform), Hundert reveals that he too is fallible, composed of contradictions, and thereby utterly human. Quite similar to but, by leagues, different from Plecki. Hundert receives a rude awakening during the Mr. Julius Caesar competition when he notices Bell peeking at an elaborate system of notes pinned within the folds of his toga. On discreet conveyance of this revelation to the seated headmaster, Hundert is bluntly advised to ignore it. Hundert’s witness of hypocrisy blatantly sanctioned by the institution for the sake of salvaging the status quo (Bell’s powerful, redneckish senator father is among the audience) is a slap in the face. Still, Deepak, the truly meritorious student, wins the competition, as cream inevi-

198

B Y C Z K IE W IC Z

tably rises to the top, with sly Bell coming in an underserved second. Twenty-five years hence representatives of Bell’s company offer a handsome donation to the academy in exchange for Hundert’s hosting a friendly rematch. Once again, come the event, Bell employs his cunning to cheat in the competition, this time in technically sophisticated fashion. However, Hundert outfoxes him by posing a perfectly legitimate yet unorthodox question to which all of Bell’s classmates know the answer but Bell, who had failed to pay attention in class, does not. His cheating has backfired this time and is now exposed to all. A parallel assessment can be drawn between the characters of Cheaters' learned yet fallen Dr. Plecki and the Emperor’s Club’s conniving student-cum-businessman Bell, neither of whom is especially compelled to observe steadfast moral principles. Both men appear to be bereft of conscience; in Bell’s case, greed; in Plecki’s, a twisted sense of justice, betrays an obsession with ends, not means, at the expense of an honest or virtuous learning process. Both men pridefully flaunt their moral detachment in making decisions based solely on the projected consequences. This obtuse lack of conscience, feigned or not, contrasts starkly with the reflexive remorse displayed by Hundert who, although he also proves himself to be capable of a moral slip amid the delusion that his exemplary, doting behavior could supersede the influence of Bell’s depraved father, finally is vindicated in the end when Bell’s classmates pay tribute to him in a sentimental show of appreciation. (Their gesture prompts him to forgive himself for his imperfection and perceived failures.)

PROCESS ORIENTATION OR DEONTOLOGY AS MORAL COMPASS Lying, cheating, and fraud are indubitably the most frequent transgressions committed by students, teachers, administrators, and parents alike, usually in pursuit of selfish interests with materialistic or shortsighted designs. “People do what they need to do to get what they want” (Hoffman, 2002) seethes Bell, referring beyond the boyish games of the academy to the no-nonsense business of life. Cheaters reveals the same (im)moral principle in operation in the character of Plecki, who abandons all semblance of moral rectitude in favor of the fleeting thrill of victory in competition. Even prior to his moral fall from grace, during a pep talk early in the film Plecki declares “I want you to know what it feels like to win,” invoking Dante’s Satan, who proclaimed that it was “better to reign in Hell than to serve in Heaven (Stockwell, 2000).” In an opposing sentiment, Shumway (2003) observed that as “language teachers, we issue an invitation to our students and provide a mechanism for imagining others and thereby for escaping a particular aspect of Dante’s version of Hell” (p. 160). By Hell he refers to self-obsession—the fuel of the prize-minded—and its path toward total emotional isolation; in issuing an invitation to escape that Hell, the mindful moral

F IL M IC P O R T R A Y A L S O F C H E A T IN G O R F R A U D

199

teacher—and teacher of morality— should endeavor to demonstrate how actions create reactions throughout a chain linking all relations. Cheating and fraud are motivated by competitive sensibilities and practices in addition to social and political inequities. As a microcosm of the world at large, the classroom functions as a laboratory for experimenting with future survival strategies and should therefore encourage an atmosphere of cooperation and collaboration. Instead, schools serve as training grounds for the cutthroat, competitive culture of capitalism, preparing students for its attendant anxieties (Oilman, 2002). Academic examinations and competitions carry irrevocable consequences, and as such, the fear of failure leads to the oft-irresistible temptation to cheat or commit other questionable maneuvers to manipulate examination results. Humiliation is another tactic employed to prepare students for the harsh realities that await them in the workplace or other social institutions where intimidation and abuse pound less powerful members of society into complacently submissive roles. If these films are at all instructive, then this standpoint must be considered. Cheaters and The Emperor’s Club both suggest that right action or character is hardly assured by staunch allegiance to aphorisms or ideals. In The Emperor’s Club, Hundert’s deepest conviction that it is his duty to mold the character of his students turns out to be overly ambitious and clumsily outmoded from the perspective of an irreverent youth bred to seek wealth and power, and ultimately it leads him to stray from his inner sense of duty. Although teacher and student represent character polar opposites, each strives toward his favored identity: Hundert as a pillar of moral certitude, parsimony, and diligence; Bell as a success in the world of material allure. “A man’s character is his fate (Hoffman, 2002),” Hundert’s resounding mantra, rings absolutely meaningless to Bell who makes a mockery of his teacher in the end by revealing that it was not nostalgic hankering that had motivated a lavish class reunion but really a clever utilitarian ploy to announce his bid for a seat in the U.S. Senate while basking in the admiration of his teenage consorts. Over the years Bell has burrowed deeper into Dante’s Hell, immersed in aspirations of ascent, still scowling 25 years after his initial exposure to the academy’s creed of altruism and philanthropy, “I’ll worry about my contribution later (Hoffman, 2002).” Indeed, Bell’s nature, as marred by hypocrisy, remorseless lies, and skill at usury, becomes his fate and these are the answers to Hundert’s resounding question: “How will history remember you?” (Hoffman, 2002). Consequentialist and nonconsequentialist perspectives are two competing ethical approaches that double as theories of social relations. A consequentialist would accord greater significance to the utilitarian outcomes of his actions, much more so than a sense of unwavering obligation to obey universal (Kantian) principles, whereas a nonconsequentialist would consider duty, obligation, and principle more compelling motivation in making ethical choices. Strike and Soltis (1998) pointed out that a “good consequentialist is not simply interested in producing any results that are intrinsically good. Consequentialists are interested in maximizing the good,

200

B Y C Z K IE W IC Z

that is, producing the most good” (p. 12). They endeavor to weigh the likely consequences of actions to determine a balance in the distribution of relative good in search of the “best set of consequences” (p. 12), termed benefit maximization. This model describes neither Bell, whose chief aim is to maintain a superficial veil of personal power, nor Plecki, who abandons all scruples, although both choose utilitarian courses. Both Hundert and Plecki act on impulse to protect short-term, emotionally driven interests. Myopic and conflicted, Hundert amends his conception of duty to step in as the ingrate Bell’s futile savior; on the other hand, driven by wrathful vengeance to outsmart the oppressive upper classes, Plecki decisively eschews any consideration of right and wrong. Because neither man seriously contemplates his actions nor possesses the omniscience to forecast the eventual impact on everyone affected, their respective missions incur major casualties— integrity, honor, and justice to name a few. But these characters, similar to almost any given educator prone to making daily gut-level decisions based on his experiential relations to students coming from diverse social, political, and historical realities, are difficult to pin down. On the flip side, unlike the hypothetically more benevolent future-oriented consequentialists, Strike and Soltis (1998) claimed that nonconsequentialists retain a memory of the past and seek to set things right: History is particularly important when we need to consider the remedy for prior injustice. It often seems morally appropriate to give someone something that they do not otherwise deserve if their current lack is a consequence of some prior injustice (p. 56).

Plecki’s disillusionment with the American Dream and subsequent lashing out against the system is readable within this context, as is Hundert’s instinctive urge to step in as Bell’s generally absent father. Both teachers use their students’ pasts as justification for their actions, but these actions do not glean maximally beneficial consequences for anyone involved.

HUMILIATION TACTICS IN THE SPHERE OF TEACHING AND TESTING I have thus far kept quiet about Stand and Deliver, a film of a somewhat different ilk with a different set of moral problems. Here, high school math teacher Jaime Escalante (played by Edward James Olmos) performs the magnificent feat of training East Los Angeles students from the barrio to master an Advanced Placement (AP) calculus examination. They proceed to achieve the highest scores in the district, but ETS later charges them with cheating without clear evidence of such an act. The test scores are also cancelled. The two ETS officers who visit the school to meet the principal and students authoritatively assert that the students have cheated. They fail to produce any evidence to substantiate this assertion but later point

F IL M IC PO R T R A Y A LS O F C H E A T IN G O R F R A U D

201

out to the teacher that ETS’ study of the patterns of errors of the students and their overall high scores make them cheaters. They deny the fact that ethnicity and the school’s location in a socioeconomically disadvantaged area are factors behind the accusation. Accustomed to bias and certain of their ability, the students opt to retake the exam and again achieve exceptional grades that indict ETS for its previous actions. What is questionable about ETS’ actions is that they do not offer credible evidence to anyone thus violating professional standards of responsibility in such matters as well as not providing the test takers the opportunity to see the evidence, the right to appeal, and to seek redress and remedy.

GROUP DYNAMICS After Plecki’s students achieve remarkably laudable scores on the Illinois state exam, ominously quoting Franklin’s Poor M an’s Almanac the fallen professor warns that “the only way three people can keep a secret is if two of them are dead.” Quite prescient a remark, for soon enough the fractured group suffers the embittered dissension of Irwin Flickas who, ego bruised at having been relegated to a nonfocal position, retaliates by finking to the media. A shared sense of identity, based on mutual past, present, or future experience, creates a foundation of solidarity that lends support to mutual aims. (Think of the rabble-rousing members of the Dead Poets Society Weir, 1989.) In analyzing the political underpinnings of interpersonal relations, specifically the individual actions of key protagonists Hundert, Plecki, and Bell, it is useful to consider the delicate dialectical social interplay that motivates moral agency: Whether we like it or not, many people believe that they act regardless of the claims of the interests of others at all. Other people take the groups they belong to ... to be fundamental to the question of who they claim to be and sufficient as providing good reasons for how they act in the world toward others. (Light, 2003, p. 7)

Recall Flickas, who violates the group’s pact not to tell in Cheaters. Possibly overwhelmed by conscience, probably acting out of spite, he nevertheless commits an act of bravery in daring to stand alone; those who adopt Plecki’s retaliatory stance likely feel at ease in that none of them are individually responsible for the fiasco. Siding with the group enables the denial of personal responsibility and justifies the “immoral” action. It is common knowledge that in many social and cultural contexts group loyalty is valued above individuality and that in most group dynamics, a dominant individual exercises control. Bell momentarily wrests this advantage away from Hundert, long enough to achieve his ends. Yet in Hundert’s defense, bear in mind that political identities seldom remain uniform or stationary, much as the interpersonal dynamic in the teacher-student relation is ever shifting.

202

B Y C Z K IE W IC Z

CLASS CONFLICT AND SOCIAL AND POLITICAL INEQUITIES In Cheaters, the lack of equitable school funding in socioeconomically opposite Chicago neighborhoods looms in the background as a source of social and political inequity and personal injury. Light (2003) referred to this narrative element as "background spatial politics” (p. 79). Compared to its affluent rival, the cheaters’ Steinmetz High is ill-equipped, a condition of inherent injustice that galvanizes Plecki’s and his students’ actions. A universal American dictum—that cheating on exams is inexcusably wrong—is openly flouted and even legitimated by placing the blame on economic disparities. The disadvantaged youth comport themselves as if victims given license to break the rules, fully blessed by Plecki, who explains to his horrified mother, already distraught at the fact that he has betrayed his students in his presumed role as a teacher of morality, that he summarily rejects his father’s “bullshit immigrant notion that if he worked hard enough all his dreams would come true (Stockwell, 2000).” Plecki’s cynical view bears no resemblance to the faith and optimism of Hundert, who never discards his belief in basic moral principles. Says Plecki on having stolen the exam, “I look at what we did as a kind of civil disobedience (Stockwell, 2000).” Back at The Emperor’s Club, a tacit moral code would reject such outright act of defiance. Although Bell certainly cannot claim to be a victim in the same sociopolitical sense as the teens at Steinmetz High, in the world he inherits from his father, where rule-breaking is tantamount to playing the game, Bell assumes the identity of cunning, conniving player early on. In this respect, Hundert perceives Bell as a victim of emotional and paternal neglect; Senator Bell has only performed his duty in priming his son to one day walk in his shoes. These three films underscore the primacy of class conflict in the academic milieu: Hundert versus privileged student in The Emperor's Club; disadvantaged students versus affluent ones in Cheaters; stereotypically underachieving East Los Angeles AP students versus the ETS bureaucracy in Stand and Deliver. This raises the question whether the transcension of economic class presents an academic aspirant’s most impenetrable barrier. For the viewer, the closet transgressions of the upper crust set in The Emperor’s Club or Dead Poets Society are more readily palatable than the blatant defiance of the street-savvy youth portrayed in Cheaters. In Stand and Deliver, the students twice rise to the challenge presented by ETS and undeniably deliver, proving that its essentially racist accusations of underhandedness are false.

CONCLUSIONS To revisit the underlying issue in Cheaters from another angle, in the United States unequal access to a decent education remains a grave and embarrassing contempo-

F IL M IC P O R T R A Y A L S O F C H E A T IN G O R F R A U D

203

rary social ill. Kunnan’s (2003) definition of fairness (consisting of two main components, social equity and equal validity), sheds light on this problem. To achieve social equity in our educational systems, especially in testing, students must first have unfettered equal access to learning opportunities. Barring equal access, the economically disadvantaged students and their teacher at Steinmetz High rightly assume that test score interpretations of the Illinois statewide competition will not have equal validity for all test takers and therefore their disposing of conventional notions of morality seems justifiably an act of civil disobedience. Historical and personal experience has demonstrated to them that longstanding inequities unequivocally impede long-term academic success. The overt privileges enjoyed by the affluent preclude fairness and hence justice for all. Arguably (and optimistically), the students at Steinmetz High are less motivated by the shallow aim of competition than they are by the desire to subvert the course of historical outcomes and to seek justice for substandard educational opportunities. It is reasonable to suggest that cheating on a high-stakes examination in America comes about as a result of a fiercely competitive culture in which opportunities are unevenly distributed. The underprivileged are likely to be dominated by those who score successfully on exams, and exam success invites the possibility of a share of the power and prestige in a highly hegemonic nation. When high-stakes testing is seen for what it really is— “a means of political control” (Kunnan, 2003, p. 13; see also Oilman, 2002)—then it is easy to sympathize with the Steinmetz High group’s conscious choice to cheat to advance their lot. Teachers, too, have flawed characters and can make irreparably poor decisions. Johnston (2003) wrote that teachers should regard “what we do in classrooms (and outside of them) [as] fundamentally rooted in the values we hold and the relation we hold with our students” (p. 5). The films on which I have focused this discussion offer some enlightenment—and some confusion—as to how one might approach questions that arise concerning lying, cheating, and fraud in testing and competitive environments. I conclude that because educators’ professional identities are inseparable from their personal identities, whether or not they choose to acknowledge this, we must trust in the spontaneity of our reactions and the fundamental values that we dearly uphold in openly apprehending the uniqueness of each scenario with which we are challenged.

REFERENCES Hoffman, M. (Director). (2002). T he E m p e ro r 's Club. [Motion picture]. United States: Universal Studios. Johnston, B. (2003). Values in E n g lish la n g u a g e teaching. Mahwah, NJ: Lawrence Erlbaum. Kunnan, A. (2003). F a irn ess a n d e th ic s in la n g u a g e a ssessm en t: C ourse readings: T E S L 567'A. Los Angeles: California State University.

204

B Y C Z K IE W IC Z

Light, A. (2003). R eel a rg u m en ts: Film , p h ilo so p h y, a n d so c ia l criticism . Boulder, CO: Westview Press. Marshall, E. O. (2003). Making the most of a good story: Effective use of film as a teaching resource for ethics. Teaching T h eo lo g y a n d R elig io n , 6, 93-98. Musca, T. (Producer/Writer), & Menendez, R. (Writer/Director). (1988). S ta n d a n d D e liv e r [Motion picture]. United States: Warner Bros. Pictures. Oilman, B. (2002, October). W hy so m a n y e xa m s? A M a rx is t response. Retrieved September 11,2003, from http://www.pipeline.com/~rgibson/whyexams.html Shumway. N. (2003). Ethics, politics, and advocacy in the foreign language classroom. In P. C. Patrikis (Ed.), R e a d in g b etw een th e lin es: P ersp ectives o n fo r e ig n la n g u a g e litera cy (pp. 159-168). New Haven, CT: Yale University Press. Stockwell J. (Writer/Director). (2000). Cheaters [Motion picture]. United States: Home Box Office. Strike, K., & Soltis, J. (1998). The e th ic s o f te a ch in g (3rd ed.). New York: Teachers College Press. Stoehr, K. (2002). F ilm a n d k n o w le d g e: E ssa y s on the in teg ra tion o f im a g es a n d ideas. Jefferson, NC: McFarland. Weir, P. (Director). (1989). Dead Poets Soceity [Motion picture]. United States: Touchstone Pictures.